# Model Training (NGBoost)

If you have your own ground truth energy data, you can train a custom RouteE powertrain model.

You'll want to make sure you've installed the proper dependencies that are not installed by default when you do a pip install. 

In this example, we'll use the NGBoost trainer and estimator which you can install by doing:

```bash
pip install ngboost
```

In [1]:
import nrel.routee.powertrain as pt

from nrel.routee.powertrain.trainers.ngboost_trainer import NGBoostTrainer


import pandas as pd



For demonstration purposes, we'll use a very small set of training data.
You can access this dataset yourself [here](https://github.com/NREL/routee-powertrain/blob/main/tests/routee-powertrain-test-data/sample_train_data.csv)

In [2]:

df = pd.read_csv("../tests/routee-powertrain-test-data/sample_train_data.csv")

In [3]:
# Load the data
df = pd.read_csv("../tests/routee-powertrain-test-data/sample_train_data.csv")
df.rename(columns={'gallons_fastsim': 'gge'}, inplace=True)
df = df.dropna()
df.head()

Unnamed: 0,speed_mph,grade_dec,miles,gge,trip_id,road_class
0,7.632068,-0.008963,0.015469,0.000813,1,3
1,6.329613,-0.047001,0.003516,0.000149,1,3
2,12.248512,0.0,0.003402,7.4e-05,1,4
3,23.752604,-0.000463,0.019768,0.002194,1,1
4,46.024926,-0.004641,0.038378,0.00097,1,0


This dataframe represents a set of road network links (i.e. roads) in which we've already computed the energy consumption over. In this case, we've use the Fastsim software to simulate a vehicle driving over a high resolution drive cycle and then have aggregated everything up to the link level. We also have link level attributes like average driving speed in mile per hour (`speed`), road gradient as a decimal (`grade`), road distance in miles (`miles`) and road classification as a integer category (`road_class`). Lastly, we have a trip identifier column (`trip_id`) which is only 1 in this case, represeting a single trip taken by this vehicle.

Ok, onto setting up the training pipeline.

First, we need to tell the trainer what feature sets we want to use for the internal estimators (Random Forests in this case). We can provide one or many feature sets, depending on the different features we might expect to see when apply this model. In this case, we'll just use three different features sets. One with just `speed`, one with `speed` and `grade` and then another with `speed`, `grade`, and `road_class`. This will make it such that our model is flexible to cases where we might only have speed information for a link or we might have more feature resolution.  

In [4]:
feature_set_1 = [pt.DataColumn(name="speed_mph", units="mph")]
feature_set_2 = [
    pt.DataColumn(name="speed_mph", units="mph"),
    pt.DataColumn(name="grade_dec", units="decimal")
]
feature_set_3 = [
    pt.DataColumn(name="speed_mph", units="mph"),
    pt.DataColumn(name="grade_dec", units="decimal"),
    pt.DataColumn(name="road_class", units="category")
]
features = [
    feature_set_1,
    feature_set_2,
    feature_set_3
]

Note that we didn't incude the distance column in any of our feature sets. That is because, RouteE Powertrain always requires distance information and so we have a special designation for distance in the training configuation whereas features can be any arbitrary link attribute. So, let's define our distance columns

In [5]:
distance = pt.DataColumn(name="miles", units="miles")

Now, we need to define our energy target which is gallons of gasoline simualted by Fastsim:

In [6]:
energy_target = pt.DataColumn(
    name="gge", 
    units="gallons_gasoline", 
)

We also need to decide how we want to predict the energy.
We have two options: "rate" or "raw".
"rate" will take our energy values and divide them by the distance column to arrive at and energy rate.
Then, the estimator will be trained to predict the rate value (without using distance as a feature) and then the model will multiply the rate value by the incoming link distance to give a final raw energy value.
This can be useful in your training data is sparse as it allows the model to be flexible to distance.
"raw" will tell the estimator to predict the energy on the link directly, using distance as an explicit feature.
This can be more robust for situations where the energy rate on a link might vary with respect to distance but can lead to weird results if there are not a good representation of different distance values in the training dataset.
In our case we'll use "rate" since our training data is very sparse.

In [7]:
predict_method = "rate"

Finally, we can build a model configuration that we can pass to the trainer. This will also include things like the vehicle powertrain type and a model name

In [8]:
config = pt.ModelConfig(
    vehicle_description="Test Vehicle",
    powertrain_type=pt.PowertrainType.ICE,
    feature_sets=features,
    distance=distance,
    target=energy_target,
    test_size=0.2,
    predict_method=predict_method
)

Now we build the random forest trainer and give it the desired parameters

In [9]:
trainer = NGBoostTrainer(n_estimators=100)

All trainers have a `train` method on them which will return a trained vehicle model

In [10]:
test_vehicle = trainer.train(df, config)

[iter 0] loss=-1.8281 val_loss=0.0000 scale=2.0000 norm=0.9229
[iter 20] loss=-2.0561 val_loss=0.0000 scale=2.0000 norm=0.6237
[iter 40] loss=-2.1728 val_loss=0.0000 scale=2.0000 norm=0.6440
[iter 60] loss=-2.2729 val_loss=0.0000 scale=2.0000 norm=0.6562
[iter 80] loss=-2.3627 val_loss=0.0000 scale=2.0000 norm=0.6721
[iter 0] loss=-1.8281 val_loss=0.0000 scale=1.0000 norm=0.4615
[iter 20] loss=-1.9869 val_loss=0.0000 scale=1.0000 norm=0.3494
[iter 40] loss=-2.1154 val_loss=0.0000 scale=2.0000 norm=0.6274
[iter 60] loss=-2.2324 val_loss=0.0000 scale=2.0000 norm=0.6587
[iter 80] loss=-2.3335 val_loss=0.0000 scale=2.0000 norm=0.6733
[iter 0] loss=-1.8281 val_loss=0.0000 scale=1.0000 norm=0.4615
[iter 20] loss=-1.9869 val_loss=0.0000 scale=1.0000 norm=0.3494
[iter 40] loss=-2.1156 val_loss=0.0000 scale=2.0000 norm=0.6275
[iter 60] loss=-2.2329 val_loss=0.0000 scale=2.0000 norm=0.6575
[iter 80] loss=-2.3344 val_loss=0.0000 scale=2.0000 norm=0.6724


With the model trained, we can inspect the errors for each estimator type and energy target (note, it's possible that we could have given multiple energy targets to the trainer, like gasoline and electricity for a plug-in hybrid vehicle)

In [11]:
test_vehicle.errors


0,1
Estimator Errors,Estimator Errors
Feature Set ID,speed_mph
Target,gge
Link RMSE,0.00147
Link Norm RMSE,0.92759
Link Weighted RPD,0.76701
Net Error,-0.29279
Actual Dist/Energy,18.87243
Predicted Dist/Energy,26.68559
Real World Predicted Dist/Energy,22.88644


To use this model to predict results on a dataframe, use the following code:

In [12]:
result_df = test_vehicle.predict(df, ['grade_dec', 'speed_mph','road_class'], 'miles', True)
result_df.head()

Unnamed: 0,gge,gge_std
0,0.000852,0.000296
1,0.000152,6.5e-05
2,0.000171,0.000106
3,0.001199,0.000789
4,0.001252,0.000776


While this training dataset is far too small to draw real conclusions, these metrics can give you an idea of how well the model performed on a holdout test set (20% of the training data as we specificed by the `test_size` parameter in the configuration. 

Now, we can write the model to a json file that can be loaded later:

```python
test_vehicle.to_file("Test_Vehicle.bin")
```

In [13]:
test_vehicle.to_file("Test_Vehicle.json")

To retrieve a saved model from a json file for further use can be done by doing the following:

In [None]:
test_vehicle = pt.Model.from_file("Test_Vehicle.json")

## RouteE Compass Integration

If you want to use this model with RouteE Compass, you can export any of the estimators as binary file and that can be loaded into RouteE Compass.

In this case, we have three estimators:

In [12]:
test_vehicle.estimators

{'speed_mph': <nrel.routee.powertrain.estimators.ngboost_estimator.NGBoostEstimator at 0x1597b3910>,
 'grade_dec&speed_mph': <nrel.routee.powertrain.estimators.ngboost_estimator.NGBoostEstimator at 0x159802b90>,
 'grade_dec&road_class&speed_mph': <nrel.routee.powertrain.estimators.ngboost_estimator.NGBoostEstimator at 0x159890850>}

For this example, we'll take the estimator with speed and grade as features and export it to a binary file.

In [17]:
test_vehicle.estimators['grade_dec&speed_mph'].to_file("test_vehicle_speed_grade.bin")

Now we can load `test_vehicle_speed_grade.bin` into RouteE Comapss