# Predictions notebook
author: Gonzalo Miranda Cabrera

objective: make predictions over the submission data and write them to a submission file.

summary:
1. show the results obtained on test data by all the models trained.
2. load the best models.
3. prepare submission data to be passed as input to the models.
4. make predictions.
5. save predictions to file.

In [2]:
import pandas as pd
from joblib import load

## Results on test data - summary table

| Metric | Model         | Random search | Grid search |
|--------|---------------|---------------|-------------|
| MAE    | SVM           | 17.696        | 17.694      |
| \|     | Random Forest | 18.375        | 18.376      |
| \|     | XGBoost       | 17.885        | 17.855      |
| \|     | Ensemble      | **17.658**    | 17.659      |
| MSE    | SVM           | 579.375       | 579.333     |
| \|     | Random Forest | 561.918       | 562.135     |
| \|     | XGBoost       | 573.365       | 567.926     |
| \|     | Ensemble      | 554.118       | **552.983** |
| RMSE   | SVM           | 24.070        | 24.069      |
| \|     | Random Forest | 23.704        | 23.709      |
| \|     | XGBoost       | 23.945        | 23.831      |
| \|     | Ensemble      | 23.539        | **23.515**      |
| R2     | SVM           | 0.440         | 0.440       |
| \|     | Random Forest | 0.457         | 0.456       |
| \|     | XGBoost       | 0.446         | 0.451       |
| \|     | Ensemble      | 0.464         | **0.465**       |

*Denoted in bold are the best results.

Based on the table we will be using grid search ensemble model to do our prediction on submission_data.csv

In [5]:
# Load models
svm_grid_search_best_model = load("grid_search_results/svm_best_model")
rfr_grid_search_best_model = load("grid_search_results/rfr_best_model")
xgb_grid_search_best_model = load("grid_search_results/xgb_best_model")

In [3]:
# Load submission data
submission_data = pd.read_csv("submission_data.csv")

# Save order_id for later creation of submission file
order_ids = submission_data.pop("order_id")

# Drop total_minutes column (NaN values)
submission_data.drop(columns=["total_minutes"], inplace=True)

# Show submission data
submission_data.head()


Unnamed: 0,on_demand,is_Friday,is_Saturday,is_Sunday,distance,found_rate,picking_speed,accepted_rate,rating,seniority_41dc7c9e385c4d2b6c1f7836973951bf,...,sqrd_unique_products,logn_unique_products,distance_div_units,distance_div_kgs,distance_div_unique_products,distance_div_order_size,unique_products_div_picking_speed,units_div_picking_speed,kgs_div_picking_speed,order_size_div_picking_speed
0,0,0,1,0,1.061235,0.8854,1.58,0.84,4.52,0,...,5184.0,4.276666,0.008101,0.077655,0.014739,0.007439,45.56962,82.278481,8.016456,90.294937
1,0,0,1,0,3.68324,0.8193,1.78,0.96,4.8,0,...,144.0,2.484907,0.153468,1.416631,0.306937,0.149725,6.741573,12.921348,0.898876,13.820225
2,0,0,1,0,3.368253,0.9091,1.23,0.88,4.84,0,...,144.0,2.484907,0.336825,0.518193,0.280688,0.232293,9.756098,7.317073,4.471545,11.788618
3,0,0,1,0,3.874249,0.8501,1.58,0.88,4.92,1,...,1.0,0.0,0.77485,3.874249,3.874249,0.968562,0.632911,2.531646,0.0,2.531646
4,0,0,1,0,2.513931,0.8366,1.94,1.0,4.88,1,...,100.0,2.302585,0.157121,1.256965,0.251393,0.157121,5.154639,7.731959,0.515464,8.247423


In [4]:
# Make prediction using the 3 models and average them
predictions = (
    svm_grid_search_best_model.predict(submission_data)
    + rfr_grid_search_best_model.predict(submission_data)
    + xgb_grid_search_best_model.predict(submission_data)
) / 3

# write to csv
response = pd.DataFrame({"predicted_total_minutes": predictions}, index=order_ids)
response.to_csv("submission_file.csv")


### Additional metrics

In [3]:
from sklearn.model_selection import train_test_split

data = pd.read_csv("data.csv", index_col="order_id")

y = data.pop("total_minutes").to_numpy()
X = data.to_numpy()

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
x_train.shape, x_test.shape, y_train.shape, y_test.shape


((3911, 37), (3912, 37), (3911,), (3912,))

In [8]:
from sklearn.metrics import r2_score, mean_squared_error

svm_random_search_best_model = load("rand_search_results/svm_best_model")
rfr_random_search_best_model = load("rand_search_results/rfr_best_model")
xgb_random_search_best_model = load("rand_search_results/xgb_best_model")


# RMSE Grid search
print("RMSE Grid search")
print(
    mean_squared_error(
        y_true=y_test, y_pred=svm_grid_search_best_model.predict(x_test), squared=False
    )
)
print(
    mean_squared_error(
        y_true=y_test, y_pred=rfr_grid_search_best_model.predict(x_test), squared=False
    )
)
print(
    mean_squared_error(
        y_true=y_test, y_pred=xgb_grid_search_best_model.predict(x_test), squared=False
    )
)
print(
    mean_squared_error(
        y_true=y_test,
        y_pred=(
            svm_grid_search_best_model.predict(x_test)
            + rfr_grid_search_best_model.predict(x_test)
            + xgb_grid_search_best_model.predict(x_test)
        )
        / 3,
        squared=False,
    )
)

# RMSE Random search
print("RMSE Random search")
print(
    mean_squared_error(
        y_true=y_test,
        y_pred=svm_random_search_best_model.predict(x_test),
        squared=False,
    )
)
print(
    mean_squared_error(
        y_true=y_test,
        y_pred=rfr_random_search_best_model.predict(x_test),
        squared=False,
    )
)
print(
    mean_squared_error(
        y_true=y_test,
        y_pred=xgb_random_search_best_model.predict(x_test),
        squared=False,
    )
)
print(
    mean_squared_error(
        y_true=y_test,
        y_pred=(
            svm_random_search_best_model.predict(x_test)
            + rfr_random_search_best_model.predict(x_test)
            + xgb_random_search_best_model.predict(x_test)
        )
        / 3,
        squared=False,
    )
)

# R2 Grid search
print("R2 Grid search")
print(r2_score(y_true=y_test, y_pred=svm_grid_search_best_model.predict(x_test)))
print(r2_score(y_true=y_test, y_pred=rfr_grid_search_best_model.predict(x_test)))
print(r2_score(y_true=y_test, y_pred=xgb_grid_search_best_model.predict(x_test)))
print(
    r2_score(
        y_true=y_test,
        y_pred=(
            svm_grid_search_best_model.predict(x_test)
            + rfr_grid_search_best_model.predict(x_test)
            + xgb_grid_search_best_model.predict(x_test)
        )
        / 3,
    )
)

# R2 Random search
print("R2 Random search")
print(r2_score(y_true=y_test, y_pred=svm_random_search_best_model.predict(x_test)))
print(r2_score(y_true=y_test, y_pred=rfr_random_search_best_model.predict(x_test)))
print(r2_score(y_true=y_test, y_pred=xgb_random_search_best_model.predict(x_test)))
print(
    r2_score(
        y_true=y_test,
        y_pred=(
            svm_random_search_best_model.predict(x_test)
            + rfr_random_search_best_model.predict(x_test)
            + xgb_random_search_best_model.predict(x_test)
        )
        / 3,
    )
)


RMSE Grid search
24.069331724085696
23.70938277809139
23.83119306960822
23.51559060539691
RMSE Random search
24.070202409927703
23.70480414297525
23.945040855966063
23.539715155482487
R2 Grid search
0.4403280311815706
0.4569422856493066
0.4513478805454606
0.46578353103869063
R2 Random search
0.44028753921621056
0.45715201047071974
0.44609325232265196
0.4646868677301107
