# Adjusting GPS counts

Load the necessary libraries.

In [1]:
import os

import datablox_od
import pandas as pd

pd.set_option("display.max_rows", 5)

%load_ext autoreload
%autoreload 2

Folder names

In [2]:
SAMPLE_DATA_DIRECTORY = os.path.join("..", "sample_data")
SCALING_DIRECTORY = os.path.join(SAMPLE_DATA_DIRECTORY, "scaling")

Load the dataset for training the GPS count adjustment model. This dataset should be a panel dataset (that is, there are observations for each entity/location across different time periods).

The columns of our training dataset are as follows:
- `ADM1_EN`: Location (entity, in panel dataset terminology)
- `month`: Month (time period, in panel dataset terminology)
- `ground_truth_moveout_count`: Ground-truth move-out count (obtained via the API service provided by Thailand's [Bureau of Registration Administration](https://www.bora.dopa.go.th/))
- `gps_moveout_count` Estimated move-out count based on the GPS data
- `estimated_population`: Estimated population based on the GPS data
- `num_appid`: Number of unique application IDs in the GPS dataset for the month of interest
- `num_deviceid`: Number of unique device IDs in the GPS dataset for the month of interest

`ADM1_EN`, `gps_moveout_count`, `estimated_population`, `num_appid`, and `num_deviceid` comprise our feature set, and `ground_truth_moveout_count` is our target variable.

In [3]:
dataset = pd.read_parquet(os.path.join(SCALING_DIRECTORY, "training_data.parquet"))
dataset

Unnamed: 0,ADM1_EN,month,ground_truth_moveout_count,gps_moveout_count,estimated_population,num_appid,num_deviceid
0,Amnat Charoen,Apr 2020,752,2,242,400,1802808
1,Amnat Charoen,Apr 2021,1163,9,1564,487279,2640291
...,...,...,...,...,...,...,...
5080,Yasothon,Sep 2023,1450,21,4706,230218,21113546
5081,Yasothon,Sep 2024,1073,28,5645,4039,11669267


It is required to preprocess the training dataset using DataBlox-OD's `datablox_od.scaling.preprocess_dataset_for_gps_count_adjustment()`, as this function also records important metadata that will be relevant later on for training and GPS count adjustment.

In [4]:
dataset = datablox_od.scaling.preprocess_dataset_for_gps_count_adjustment(
    dataset,
    entity_column="ADM1_EN",
    time_column="month",
    ground_truth_count_column="ground_truth_moveout_count",
    gps_count_column="gps_moveout_count",
    add_one_to_gps_counts=True,
    include_only_ground_truth_greater_than_gps_counts=True,
    log_normalize=True,
    add_zero_gps_count_zero_ground_truth=True,
)

print(dataset.metadata)
dataset.dataset

{'entity_column': 'ADM1_EN', 'time_column': 'month', 'ground_truth_count_column': 'ground_truth_moveout_count', 'gps_count_column': 'gps_moveout_count', 'add_one_to_gps_counts': True, 'log_normalize': True}


Unnamed: 0,ADM1_EN,month,ground_truth_moveout_count,gps_moveout_count,estimated_population,num_appid,num_deviceid
0,Amnat Charoen,Apr 2020,6.624065,1.386294,5.493061,5.993961,14.404857
1,Amnat Charoen,Apr 2021,7.059618,2.397895,7.355641,13.096594,14.786400
...,...,...,...,...,...,...,...
10162,Yasothon,Sep 2023,0.000000,0.000000,8.456806,12.346786,16.865425
10163,Yasothon,Sep 2024,0.000000,0.000000,8.638703,8.304000,16.272469


Initialize the GPS count adjustment model using DataBlox-OD's `datablox_od.scaling.build_model()`. 

In [5]:
model = datablox_od.scaling.build_model(
    gps_count_column="gps_moveout_count",
    random_state=42,
    model_type="xgboost",
    n_jobs=-1,
    max_depth=3,
    min_child_weight=5,
    eta=0.05,
    n_estimators=150,
    max_bin=256,
)

Train the model using DataBlox-OD's `datablox_od.scaling.train_model()`.

In [6]:
model = datablox_od.scaling.train_model(model, dataset)

Finally, we can obtain the adjusted counts using `datablox_od.scaling.adjust_gps_counts()`.

When using this function, we only need to provide the entity/location, time/month, and GPS count since the other features (namely, the number of unique application IDs and the number of unique device IDs) are automatically retrieved from the training dataset based on the specified month.

```{important}
When using ``datablox_od.scaling.adjust_gps_counts()``, the passed GPS counts are automatically preprocessed based on how the training dataset was preprocessed. This is why we require it to be preprocessed via ``datablox_od.scaling.preprocess_dataset_for_gps_count_adjustment()``, as this function records metadata on the preprocessing steps.

To reiterate, ``datablox_od.scaling.adjust_gps_counts()`` expects a **raw GPS count** &mdash; without any form of preprocessing. Likewise, the adjusted count it returns should be taken as is. Other than maybe rounding it to the nearest integer, no postprocessing is needed.
```

In [7]:
print("Place, time: GPS count -> adjusted count")
print("========================================")

provinces = ["Phuket", "Bangkok"]
time_periods = ["Jun 2024", "Dec 2024"]
gps_count = 500

for province in provinces:
    for time in time_periods:
        print(
            f"{province}, {time}: {gps_count} -> ",
            datablox_od.scaling.adjust_gps_counts(
                model, entity=province, time=time, gps_count=gps_count
            ),
        )

Place, time: GPS count -> adjusted count
Phuket, Jun 2024: 500 ->  2882.494579990721
Phuket, Dec 2024: 500 ->  2749.484316807332
Bangkok, Jun 2024: 500 ->  23711.196156458893
Bangkok, Dec 2024: 500 ->  22952.297476200405
