# Taxi trips ~ feature engineering

## Peeking at the data

Downloaded from https://www.kaggle.com/c/nyc-taxi-trip-duration

In [2]:
!ls ~/river_data/Taxis


train.csv


In [3]:
!head ~/river_data/Taxis/train.csv


id,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
id0190469,2,2016-01-01 00:00:17,2016-01-01 00:14:26,5,-73.98174285888672,40.71915817260742,-73.93882751464845,40.82918167114258,N,849
id1665586,1,2016-01-01 00:00:53,2016-01-01 00:22:27,1,-73.98508453369139,40.74716567993164,-73.95803833007811,40.71749114990234,N,1294
id1210365,2,2016-01-01 00:01:01,2016-01-01 00:07:49,5,-73.9652786254883,40.80104064941406,-73.94747924804686,40.81517028808594,N,408
id3888279,1,2016-01-01 00:01:14,2016-01-01 00:05:54,1,-73.98229217529298,40.751331329345696,-73.99134063720702,40.75033950805664,N,280
id0924227,1,2016-01-01 00:01:20,2016-01-01 00:13:36,1,-73.97010803222656,40.75979995727539,-73.9893569946289,40.742988586425774,N,736
id2294362,2,2016-01-01 00:01:33,2016-01-01 00:13:25,1,-73.98499298095702,40.77389144897461,-73.93649291992188,40.84777069091797,N,712
id1078247,2,2016-01-01 00:01:37,

Our goal is to build a model that estimates the duration of each taxi trip duration. We'll focus on this for the next two courses. This part will be about feature engineering, while the next part will be about model building.

We'll do the feature engineering in SQL. This is an excuse to teach you some SQL. However, it's also interesting because it's a good example of how to do feature engineering in a database. This is a common situation in industry, where you have a lot of data in a database, and you want to do feature engineering in the database before you pull the data out for modeling.

We'll be using [DuckDB](https://duckdb.org/). It's a minimalist data warehouse that can run on a laptop. It's like SQLite, but it's column-oriented, which means it's suited for analytics.

In [5]:
import duckdb

with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE TABLE trips AS (
        SELECT *
        FROM read_csv_auto('~/river_data/Taxis/train.csv')
        WHERE 40.64150619506836 < pickup_latitude
        AND pickup_latitude < 40.84248242950439
        AND -74.01721954345702 < pickup_longitude
        AND pickup_longitude < -73.7766876220703
    )
    ''')


In [6]:
with duckdb.connect('taxi-trips.db') as db:
    db.table('trips').show()


┌───────────┬───────────┬─────────────────────┬───┬────────────────────┬────────────────────┬───────────────┐
│    id     │ vendor_id │   pickup_datetime   │ … │  dropoff_latitude  │ store_and_fwd_flag │ trip_duration │
│  varchar  │   int64   │      timestamp      │   │       double       │      varchar       │     int64     │
├───────────┼───────────┼─────────────────────┼───┼────────────────────┼────────────────────┼───────────────┤
│ id0190469 │         2 │ 2016-01-01 00:00:17 │ … │  40.82918167114258 │ N                  │           849 │
│ id1665586 │         1 │ 2016-01-01 00:00:53 │ … │  40.71749114990234 │ N                  │          1294 │
│ id1210365 │         2 │ 2016-01-01 00:01:01 │ … │  40.81517028808594 │ N                  │           408 │
│ id3888279 │         1 │ 2016-01-01 00:01:14 │ … │  40.75033950805664 │ N                  │           280 │
│ id0924227 │         1 │ 2016-01-01 00:01:20 │ … │ 40.742988586425774 │ N                  │           736 │
│ id229436

In [8]:
query = '''
SELECT
    MIN(pickup_datetime),
    MAX(pickup_datetime),
    COUNT(*),
    AVG(trip_duration),
    MAX(trip_duration),
    MIN(trip_duration),
    QUANTILE(trip_duration, 0.99)
FROM trips
'''
with duckdb.connect('taxi-trips.db') as db:
    job = db.execute(query)
    df = job.fetch_df()
df.loc[0]


min(pickup_datetime)             2016-01-01 00:00:17
max(pickup_datetime)             2016-06-30 23:59:39
count_star()                                 1453068
avg(trip_duration)                        957.844934
max(trip_duration)                           3526282
min(trip_duration)                                 1
quantile(trip_duration, 0.99)                   3423
Name: 0, dtype: object

Let's verify the data is balanced across months.

In [9]:
query = '''
SELECT
    EXTRACT(MONTH FROM pickup_datetime),
    COUNT(*)
FROM trips
GROUP BY 1
'''
with duckdb.connect('taxi-trips.db') as db:
    job = db.execute(query)
    df = job.fetch_df()
df


Unnamed: 0,"main.date_part('month', pickup_datetime)",count_star()
0,1,228849
1,2,237445
2,3,255235
3,4,250674
4,5,247405
5,6,233460


This is important for cross-validation purposes. It's reassuring to know the amount of taxi trips is roughly the same across months. There is a less chance to have a month over-represented in the training set and under-represented in the validation set.

## Define the target

We'll create a table to hold the targets: the duration of each trip.

In [11]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE TABLE targets AS (
        SELECT
            id,
            IF(trip_duration > 3600, 3600, trip_duration) AS trip_duration
        FROM trips
    )
    ''')
    targets = db.execute('SELECT * FROM targets').fetch_df()
targets.head()


Unnamed: 0,id,trip_duration
0,id0190469,849
1,id1665586,1294
2,id1210365,408
3,id3888279,280
4,id0924227,736


## Distance features

We're looking to predict a duration. The most obvious feature is the distance between the pickup and dropoff locations. We'll use the Euclidean distance, which is the distance as the crow flies. We'll also add the Manhattan distance, which is the distance between two points if you can only travel on a rectangular grid.

In [8]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE VIEW distances AS (
        SELECT
            id,
            (
                ABS(pickup_longitude - dropoff_longitude) +
                ABS(pickup_latitude - dropoff_latitude)
            ) AS l1_distance,
            POW(
                POW(pickup_longitude - dropoff_longitude, 2) +
                POW(pickup_latitude - dropoff_latitude, 2),
                0.5
            ) AS l2_distance
        FROM trips
    )
    ''')
    distances = db.execute('SELECT * FROM distances').fetch_df()
distances.head()


Unnamed: 0,id,l1_distance,l2_distance
0,id0190469,0.152939,0.118097
1,id1665586,0.056721,0.040151
2,id1210365,0.031929,0.022726
3,id3888279,0.01004,0.009103
4,id0924227,0.03606,0.025557


## Basic model to evaluate feature importance

The goal of feature engineering is, well, to produce good features. Good features are those that improve the predictive performance of the model. Some models provide the importance of each feature. Therefore, when we're doing feature engineering, we can iterate on the features, we can iterate and measure the performance of each change. It is thus useful to have a baseline model to compare the performance of the features we're engineering.

The way we'll do this is here is that we'll join the features with the targets based on a shared `id` column. This works in our case because we've calculated the feature for each row in the dataset. We'll keep this pattern going forward.

In [9]:
with duckdb.connect('taxi-trips.db') as db:
    dataset = (
        db.execute('''
        SELECT
            trips.id,
            trips.pickup_datetime,
            targets.trip_duration,
            distances.* EXCLUDE (id)
        FROM trips
        LEFT JOIN targets USING (id)
        LEFT JOIN distances USING (id)
        ''')
        .fetch_df()
    )
dataset.head()


Unnamed: 0,id,pickup_datetime,trip_duration,l1_distance,l2_distance
0,id0190469,2016-01-01 00:00:17,849,0.152939,0.118097
1,id1665586,2016-01-01 00:00:53,1294,0.056721,0.040151
2,id1210365,2016-01-01 00:01:01,408,0.031929,0.022726
3,id3888279,2016-01-01 00:01:14,280,0.01004,0.009103
4,id0924227,2016-01-01 00:01:20,736,0.03606,0.025557


Our baseline model will be a LightGBM regressor. A nice aspect of tree models is that they handle feature interactions, so we don't need to engineer them ourselves.

In [10]:
import datetime as dt
import lightgbm as lgb
from sklearn import model_selection

def train_and_test(dataset):

    cv = model_selection.TimeSeriesSplit(n_splits=5)
    dataset = dataset.sort_values('pickup_datetime')
    X = dataset.drop(columns=['id', 'trip_duration', 'pickup_datetime'])
    y = dataset['trip_duration']

    for col in X.columns[X.dtypes == 'object']:
        X[col] = X[col].astype('category')

    model = lgb.LGBMRegressor(
        n_estimators=30,
        max_depth=5,
        random_state=42
    )

    scores = model_selection.cross_val_score(
        model, X, y, scoring='neg_mean_absolute_error', cv=cv,
    )
    mae = -scores.mean()
    print(f'[MAE] {dt.timedelta(seconds=mae)}')

    model.fit(X, y)
    feature_importances = sorted(
        zip(X.columns, model.feature_importances_),
        key=lambda x: x[1],
        reverse=True
    )
    for feauture_name, importance in feature_importances:
        print(f'{feauture_name}: {importance}')


In [11]:
train_and_test(dataset)


[MAE] 0:04:22.757189
l2_distance: 543
l1_distance: 357


This is our baseline. Our objective is to improve on this baseline. It's a good habit to write this baseline down and update it as we improve our model. It's a bit like writing down a progress log of an expedition.

```
[MAE] 0:04:22.751558
l2_distance: 553
l1_distance: 347
```

Of course, it matters how we split the data between training and validation. We'll use an arbitrary tempoeral split for now. This makes sense if, say, we're looking to build a good model for the next month. However, if we're looking to build a model that generalizes well to any month, we should use a random split.

## Time features

In [55]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE VIEW time_features AS (
        SELECT
            id,
            EXTRACT(HOUR FROM pickup_datetime) AS pickup_hour,
            EXTRACT(WEEKDAY FROM pickup_datetime) AS pickup_weekday,
        FROM trips
    )
    ''')

    dataset = (
        db.execute('''
        SELECT
            trips.id,
            trips.pickup_datetime,
            targets.trip_duration,
            distances.* EXCLUDE (id),
            time_features.* EXCLUDE (id)
        FROM trips
        LEFT JOIN targets USING (id)
        LEFT JOIN distances USING (id)
        LEFT JOIN time_features USING (id)
        ''')
        .fetch_df()
    )

train_and_test(dataset)


[MAE] 0:03:56.074231
pickup_hour: 424
l2_distance: 227
pickup_weekday: 209
l1_distance: 40


What's interesting here is that took us very little code to evaluate the impact of the new features. Generally speaking, adding simple features always improves the performance of a gradient boosting tree model. This isn't always the case for temporal features.

Before we go into that, let's see if changing the type of these features helps. Indeed, our model treats the hour (and weekday) as numeric features. So the hour feature varies from 0 to 23. But that's not ideal, because the behavior pattern of taxis is simular at 1AM as it is at 11PM. We can fix this by treating the hour as a categorical feature. Gradient boosted tree models can handle categorical features natively. Indeed, a tree model can make a split in its structure by selecting several categories at once.

In [None]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE VIEW time_features AS (
        SELECT
            id,
            CAST(EXTRACT(HOUR FROM pickup_datetime) AS STRING) AS hour,
            CAST(EXTRACT(WEEKDAY FROM pickup_datetime) AS STRING) AS weekday,
        FROM trips
    )
    ''')

    dataset = (
        db.execute('''
        SELECT
            trips.id,
            trips.pickup_datetime,
            targets.trip_duration,
            distances.* EXCLUDE (id),
            time_features.* EXCLUDE (id)
        FROM trips
        LEFT JOIN targets USING (id)
        LEFT JOIN distances USING (id)
        LEFT JOIN time_features USING (id)
        ''')
        .fetch_df()
    )

train_and_test(dataset)


[MAE] 0:03:51.947366
hour: 353
l2_distance: 264
weekday: 228
l1_distance: 55


That's a bit better, which makes sense. Note that there's something called [cyclic encoding](https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html) to deal with time features. We won't go into that here.

## Average duration per day of week and hour

In practice, the best features are those that exploit the target. That makes sense because what we're attempting to do is, well, predict the target. Using features that have nothing to do with the target can help, but it's not a guarantee.

There is some danger to using the target. We're not supposed to know during inference time. What works locally during cross-validation may not work once the model is deployed. It's crucial to be aware of this.

In [13]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE VIEW time_aggs AS (

        WITH per_hour AS (
            SELECT
                EXTRACT(HOUR FROM pickup_datetime) AS hour,
                AVG(trip_duration) AS avg_duration_per_hour
            FROM trips
            GROUP BY 1
        ),

        per_weekday AS (
            SELECT
                EXTRACT(WEEKDAY FROM pickup_datetime) AS weekday,
                AVG(trip_duration) AS avg_duration_per_weekday
            FROM trips
            GROUP BY 1
        ),

        overall AS (
            SELECT
                AVG(trip_duration) AS avg_duration
            FROM trips
        )

        SELECT
            trips.id,
            per_hour.avg_duration_per_hour,
            per_weekday.avg_duration_per_weekday,
            overall.avg_duration
        FROM trips
        LEFT JOIN per_hour ON
            EXTRACT(HOUR FROM pickup_datetime) = per_hour.hour
        LEFT JOIN per_weekday ON
            EXTRACT(WEEKDAY FROM pickup_datetime) = per_weekday.weekday
        LEFT JOIN overall ON TRUE

    )
    ''')
    time_aggs = db.execute('SELECT * FROM time_aggs').fetch_df()

time_aggs.head()


Unnamed: 0,id,avg_duration_per_hour,avg_duration_per_weekday,avg_duration
0,id0190469,936.750604,988.769519,957.844934
1,id1665586,936.750604,988.769519,957.844934
2,id1210365,936.750604,988.769519,957.844934
3,id3888279,936.750604,988.769519,957.844934
4,id0924227,936.750604,988.769519,957.844934


Let's join these with the distance and time features we've engineered so far.

In [14]:
with duckdb.connect('taxi-trips.db') as db:
    dataset = (
        db.execute('''
        SELECT
            trips.id,
            trips.pickup_datetime,
            targets.trip_duration,
            distances.* EXCLUDE (id),
            time_features.* EXCLUDE (id),
            time_aggs.* EXCLUDE (id)
        FROM trips
        LEFT JOIN targets USING (id)
        LEFT JOIN distances USING (id)
        LEFT JOIN time_features USING (id)
        LEFT JOIN time_aggs USING (id)

        ''')
        .fetch_df()
    )
dataset.head()


Unnamed: 0,id,pickup_datetime,trip_duration,l1_distance,l2_distance,hour,weekday,avg_duration_per_hour,avg_duration_per_weekday,avg_duration
0,id0190469,2016-01-01 00:00:17,849,0.152939,0.118097,0,5,936.750604,988.769519,957.844934
1,id1665586,2016-01-01 00:00:53,1294,0.056721,0.040151,0,5,936.750604,988.769519,957.844934
2,id1210365,2016-01-01 00:01:01,408,0.031929,0.022726,0,5,936.750604,988.769519,957.844934
3,id3888279,2016-01-01 00:01:14,280,0.01004,0.009103,0,5,936.750604,988.769519,957.844934
4,id0924227,2016-01-01 00:01:20,736,0.03606,0.025557,0,5,936.750604,988.769519,957.844934


In [15]:
train_and_test(dataset)


[MAE] 0:03:52.008939
hour: 336
l2_distance: 272
weekday: 144
avg_duration_per_weekday: 73
l1_distance: 50
avg_duration_per_hour: 25
avg_duration: 0


These features don't bring much because they're convey the same information as the day of week and hour features. It's true: knowing the hour of the day is equivalent to knowing the average trip duration per hour of the day. There's no big relevant information gain in what we just did.

Before we go further, there's a big issue to spot in what we just: we're leaking the target variable into the features. We're using the average duration per day of week and hour to predict the duration. But the duration is used to compute the average duration per day of week and hour. That's cheating. We need to fix that.

## Rolling average duration per day of week and hour

Let's look at how to do rolling features. The idea is that we don't want to include all rows when calculating the average duration at a given point in time. For forecasting, we indeed should only be aware of what happened before the point in time we're trying to predict. This is what window calculations allow doing.

In [16]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE VIEW rolling_time_aggs AS (

        SELECT
            trips.id,

            AVG(trip_duration) OVER (
                PARTITION BY EXTRACT(HOUR FROM pickup_datetime)
                ORDER BY pickup_datetime
                ROWS BETWEEN UNBOUNDED PRECEDING
                AND 1 PRECEDING
            ) AS avg_duration_per_hour,

            AVG(trip_duration) OVER (
                PARTITION BY EXTRACT(WEEKDAY FROM pickup_datetime)
                ORDER BY pickup_datetime
                ROWS BETWEEN UNBOUNDED PRECEDING
                AND 1 PRECEDING
            ) AS avg_duration_per_weekday,

            AVG(trip_duration) OVER (
                ORDER BY pickup_datetime
                ROWS BETWEEN UNBOUNDED PRECEDING
                AND 1 PRECEDING
            ) AS avg_duration

        FROM trips

    )
    ''')
    rolling_time_aggs = db.execute('SELECT * FROM rolling_time_aggs').fetch_df()

rolling_time_aggs.head()


Unnamed: 0,id,avg_duration_per_hour,avg_duration_per_weekday,avg_duration
0,id0190469,,,
1,id1665586,849.0,849.0,849.0
2,id1210365,1071.5,1071.5,1071.5
3,id3888279,850.333333,850.333333,850.333333
4,id0924227,707.75,707.75,707.75


In [17]:
with duckdb.connect('taxi-trips.db') as db:
    dataset = (
        db.execute('''
        SELECT
            trips.id,
            trips.pickup_datetime,
            targets.trip_duration,
            distances.* EXCLUDE (id),
            time_features.* EXCLUDE (id)
        FROM trips
        LEFT JOIN targets USING (id)
        LEFT JOIN distances USING (id)
        LEFT JOIN time_features USING (id)
        LEFT JOIN rolling_time_aggs USING (id)
        ''')
        .fetch_df()
    )

train_and_test(dataset)


[MAE] 0:03:52.068847
hour: 353
l2_distance: 269
weekday: 225
l1_distance: 53


The performance got slightly worse, because it's more realistic now. We're not cheating anymore: we're only using information that happens before each row. Moreover, the windows are the large: what matters is what the average duration was in the past few hours, not the whole past history.

The nice thing is that we now have a practical way to manipulate windows, using `OVER`.

## Looking at recent windows

In [18]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE VIEW recent_time_aggs AS (

        SELECT
            trips.id,

            AVG(trip_duration) OVER (
                PARTITION BY EXTRACT(HOUR FROM pickup_datetime)
                ORDER BY pickup_datetime
                ROWS BETWEEN 100 PRECEDING
                AND 1 PRECEDING
            ) AS avg_duration_per_hour_recent,

            AVG(trip_duration) OVER (
                PARTITION BY EXTRACT(WEEKDAY FROM pickup_datetime)
                ORDER BY pickup_datetime
                ROWS BETWEEN 100 PRECEDING
                AND 1 PRECEDING
            ) AS avg_duration_per_weekday_recent,

            AVG(trip_duration) OVER (
                ORDER BY pickup_datetime
                ROWS BETWEEN 100 PRECEDING
                AND 1 PRECEDING
            ) AS avg_duration_recent

        FROM trips

    )
    ''')

    dataset = (
        db.execute('''
        SELECT
            trips.id,
            trips.pickup_datetime,
            targets.trip_duration,
            distances.* EXCLUDE (id),
            time_features.* EXCLUDE (id),
            rolling_time_aggs.* EXCLUDE (id),
            recent_time_aggs.* EXCLUDE (id)
        FROM trips
        LEFT JOIN targets USING (id)
        LEFT JOIN distances USING (id)
        LEFT JOIN time_features USING (id)
        LEFT JOIN rolling_time_aggs USING (id)
        LEFT JOIN recent_time_aggs USING (id)
        ''')
        .fetch_df()
    )

train_and_test(dataset)


[MAE] 0:03:51.816796
hour: 292
l2_distance: 248
weekday: 121
avg_duration_recent: 64
avg_duration_per_weekday_recent: 52
avg_duration_per_hour_recent: 37
l1_distance: 33
avg_duration_per_weekday: 31
avg_duration: 18
avg_duration_per_hour: 4


That's still not a great performance. Why? Well, we're still looking at the average duration of trips overall, even it's limited in time. We are not distinguishing between the different trips. In other words, we're not looking at the average duration of trips that are similar to the trip we're trying to predict.

## Including geography

The averages we calculate are over the whole dataset. But we know that the duration depends on the pickup and dropoff locations. Let's include the geography in our averages.

In [19]:
with duckdb.connect('taxi-trips.db') as db:
    grid = db.execute('''
    CREATE OR REPLACE TABLE cells AS

        WITH min_max_coords AS (
            SELECT
                MIN(pickup_latitude) AS min_lat,
                MAX(pickup_latitude) AS max_lat,
                MIN(pickup_longitude) AS min_lon,
                MAX(pickup_longitude) AS max_lon
            FROM trips
        ),

        grid AS (
            SELECT x, y
            FROM (
                SELECT (ROW_NUMBER() OVER () - 1) AS x
                FROM main.range(1, 21)
            )
            CROSS JOIN (
                SELECT (ROW_NUMBER() OVER () - 1) AS y
                FROM main.range(1, 21)
            )
        )

        SELECT
            FORMAT('{}-{}', x, y) AS cell_id,
            min_lat + (y * lat_interval) AS cell_min_lat,
            min_lat + ((y + 1) * lat_interval) AS cell_max_lat,
            min_lon + (x * lon_interval) AS cell_min_lon,
            min_lon + ((x + 1) * lon_interval) AS cell_max_lon
        FROM grid
        CROSS JOIN (
            SELECT
                *,
                (max_lat - min_lat) / 20 AS lat_interval,
                (max_lon - min_lon) / 20 AS lon_interval
            FROM min_max_coords
        )
    ''')

    cells = db.execute('SELECT * FROM cells').fetch_df()
cells.head()


Unnamed: 0,cell_id,cell_min_lat,cell_max_lat,cell_min_lon,cell_max_lon
0,0-0,40.64151,40.651558,-74.017212,-74.005186
1,1-0,40.64151,40.651558,-74.005186,-73.99316
2,2-0,40.64151,40.651558,-73.99316,-73.981134
3,3-0,40.64151,40.651558,-73.981134,-73.969109
4,4-0,40.64151,40.651558,-73.969109,-73.957083


Let's visualize this grid.

In [20]:
import folium
import json

min_lat = cells['cell_min_lat'].min()
max_lat = cells['cell_max_lat'].max()
min_lon = cells['cell_min_lon'].min()
max_lon = cells['cell_max_lon'].max()
m = folium.Map(
    location=[(min_lat + max_lat) / 2, (min_lon + max_lon) / 2],
    zoom_start=11
)

folium.GeoJson(
    data={
        "type": "FeatureCollection",
        "features": [
            {
                "type": "Feature",
                "properties": {
                    "name": f"Cell {cell['cell_id']}"
                },
                "geometry": {
                    "type": "Polygon",
                    "coordinates": [
                        [
                            [cell['cell_min_lon'], cell['cell_min_lat']],
                            [cell['cell_max_lon'], cell['cell_min_lat']],
                            [cell['cell_max_lon'], cell['cell_max_lat']],
                            [cell['cell_min_lon'], cell['cell_max_lat']],
                        ]
                    ]
                }
            }
            for cell in cells.to_dict(orient='records')
        ]
    },
    name="Grid"
).add_to(m)
m


We can first use these cells to encode the trip. This should complement the distance features we already have.

In [21]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE VIEW cell_pairs AS (
        SELECT
            trips.id,
            pickup_cells.cell_id AS pickup_cell_id,
            dropoff_cells.cell_id AS dropoff_cell_id,
            FORMAT('{}_{}', pickup_cells.cell_id, dropoff_cells.cell_id) AS trip_cell_pair
        FROM trips
        LEFT JOIN cells AS pickup_cells ON
            pickup_latitude BETWEEN pickup_cells.cell_min_lat AND pickup_cells.cell_max_lat
            AND pickup_longitude BETWEEN pickup_cells.cell_min_lon AND pickup_cells.cell_max_lon
        LEFT JOIN cells AS dropoff_cells ON
            dropoff_latitude BETWEEN dropoff_cells.cell_min_lat AND dropoff_cells.cell_max_lat
            AND dropoff_longitude BETWEEN dropoff_cells.cell_min_lon AND dropoff_cells.cell_max_lon
    )
    ''')
    cell_pairs = db.execute('SELECT * FROM cell_pairs').fetch_df()

cell_pairs.head()


Unnamed: 0,id,pickup_cell_id,dropoff_cell_id,trip_cell_pair
0,id1162556,2-11,1-19,2-11_1-19
1,id3831771,2-12,1-19,2-12_1-19
2,id2701898,2-11,1-18,2-11_1-18
3,id2141697,1-17,1-17,1-17_1-17
4,id2614733,0-16,0-16,0-16_0-16


In [27]:
with duckdb.connect('taxi-trips.db') as db:
    dataset = (
        db.execute('''
        SELECT
            trips.id,
            trips.pickup_datetime,
            targets.trip_duration,
            distances.* EXCLUDE (id),
            time_features.* EXCLUDE (id),
            recent_time_aggs.* EXCLUDE (id),
            cell_pairs.* EXCLUDE (id)
        FROM trips
        LEFT JOIN targets USING (id)
        LEFT JOIN distances USING (id)
        LEFT JOIN time_features USING (id)
        LEFT JOIN rolling_time_aggs USING (id)
        LEFT JOIN recent_time_aggs USING (id)
        LEFT JOIN cell_pairs USING (id)
        ''')
        .fetch_df()
    )

train_and_test(dataset)


[MAE] 0:03:36.817486
hour: 187
dropoff_cell_id: 171
l2_distance: 166
trip_cell_pair: 144
pickup_cell_id: 67
weekday: 60
avg_duration_per_weekday_recent: 40
avg_duration_recent: 34
avg_duration_per_hour_recent: 18
l1_distance: 13


That's a boost! It makes that knowing what the trip will be about is useful to predict the duration.

## Count-based encoding

Count-based encoding is an easy way to encode categorical variables. It's a bit like one-hot encoding, but instead of having a 1 for the category and 0 for the others, we have the number of times the category appears in the dataset. It's a bit like a frequency encoding, but it's more robust to outliers. It doesn't use the target, so it's not prone to overfitting.

In [29]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE VIEW cell_pair_counts AS (
        SELECT
            trips.id,
            pickup_cells.cell_id AS pickup_cell_id,
            dropoff_cells.cell_id AS dropoff_cell_id,
            COUNT(*) OVER (
                PARTITION BY pickup_cells.cell_id, dropoff_cells.cell_id
            ) AS cell_pair_count
        FROM trips
        LEFT JOIN cells AS pickup_cells ON
            pickup_latitude BETWEEN pickup_cells.cell_min_lat AND pickup_cells.cell_max_lat
            AND pickup_longitude BETWEEN pickup_cells.cell_min_lon AND pickup_cells.cell_max_lon
        LEFT JOIN cells AS dropoff_cells ON
            dropoff_latitude BETWEEN dropoff_cells.cell_min_lat AND dropoff_cells.cell_max_lat
            AND dropoff_longitude BETWEEN dropoff_cells.cell_min_lon AND dropoff_cells.cell_max_lon
    )
    ''')
    cell_pair_counts = db.execute('SELECT * FROM cell_pair_counts').fetch_df()

cell_pair_counts.head()


Unnamed: 0,id,pickup_cell_id,dropoff_cell_id,cell_pair_count
0,id1615460,0-0,2-2,2
1,id0284504,0-0,2-2,2
2,id0847078,0-0,3-4,1
3,id1658534,0-0,6-10,1
4,id2759103,0-1,4-7,1


In [34]:
(
    cell_pair_counts
    .groupby(['pickup_cell_id', 'dropoff_cell_id']).first()
    ['cell_pair_count']
    .sort_values(ascending=False)
    .head(10)
)


pickup_cell_id  dropoff_cell_id
3-11            2-10               11054
2-10            3-11               10205
2-11            3-11                8711
2-10            2-11                8132
2-11            2-10                8112
3-11            3-11                7559
2-10            2-10                7392
                2-9                 6715
3-11            2-11                6675
4-12            3-11                6625
Name: cell_pair_count, dtype: int64

In [36]:
pickup_counts = (
    cell_pair_counts
    .groupby(['pickup_cell_id', 'dropoff_cell_id']).first()
    .groupby('pickup_cell_id')['cell_pair_count'].sum()
)
pickup_counts.sample(10)


pickup_cell_id
15-5       23
8-14       11
9-5        26
14-7       93
10-9      123
2-14     1133
7-10     1299
7-6       298
14-11      11
6-16     2071
Name: cell_pair_count, dtype: int64

In [41]:
from branca.colormap import linear

min_lat = cells['cell_min_lat'].min()
max_lat = cells['cell_max_lat'].max()
min_lon = cells['cell_min_lon'].min()
max_lon = cells['cell_max_lon'].max()
m = folium.Map(
    location=[(min_lat + max_lat) / 2, (min_lon + max_lon) / 2],
    zoom_start=11
)

# Create a linear color map based on the 'cell_value' column
colormap = linear.YlOrRd_04.scale(
    pickup_counts.min(),
    pickup_counts.max()
)

def style_function(feature):
    cell_value = feature['properties']['cell_value']
    color = colormap(cell_value)
    return {
        'fillColor': color,
        'color': 'black',
        'weight': 2,
        'fillOpacity': 0.7
    }

folium.GeoJson(
    data={
        "type": "FeatureCollection",
        "features": [
            {
                "type": "Feature",
                "properties": {
                    "name": f"Cell {cell['cell_id']}",
                    "cell_value": int(pickup_counts.get(cell['cell_id'], 0))
                },
                "geometry": {
                    "type": "Polygon",
                    "coordinates": [
                        [
                            [cell['cell_min_lon'], cell['cell_min_lat']],
                            [cell['cell_max_lon'], cell['cell_min_lat']],
                            [cell['cell_max_lon'], cell['cell_max_lat']],
                            [cell['cell_min_lon'], cell['cell_max_lat']],
                        ]
                    ]
                }
            }
            for cell in cells.to_dict(orient='records')
        ]
    },
    name="Grid",
    style_function=style_function
).add_to(m)
m


In [45]:
with duckdb.connect('taxi-trips.db') as db:
    dataset = (
        db.execute('''
        SELECT
            trips.id,
            trips.pickup_datetime,
            targets.trip_duration,
            distances.* EXCLUDE (id),
            time_features.* EXCLUDE (id),
            recent_time_aggs.* EXCLUDE (id),
            cell_pair_counts.* EXCLUDE (id)
        FROM trips
        LEFT JOIN targets USING (id)
        LEFT JOIN distances USING (id)
        LEFT JOIN time_features USING (id)
        LEFT JOIN rolling_time_aggs USING (id)
        LEFT JOIN recent_time_aggs USING (id)
        LEFT JOIN cell_pair_counts USING (id)
        ''')
        .fetch_df()
    )

train_and_test(dataset)


[MAE] 0:03:37.623455
dropoff_cell_id: 231
hour: 199
l2_distance: 179
pickup_cell_id: 112
weekday: 60
avg_duration_recent: 40
avg_duration_per_weekday_recent: 35
avg_duration_per_hour_recent: 19
cell_pair_count: 14
l1_distance: 11


## Geo based rolling averages

In [45]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE VIEW geo_aggs AS (
        SELECT
            id,
            pickup_cell_id,
            dropoff_cell_id,

            AVG(trip_duration) OVER (
                PARTITION BY pickup_cell_id, dropoff_cell_id
                ORDER BY pickup_datetime
                ROWS BETWEEN 1000 PRECEDING
                AND 1 PRECEDING
            ) AS avg_duration_per_cell_pair,

            AVG(trip_duration) OVER (
                PARTITION BY EXTRACT(HOUR FROM pickup_datetime), pickup_cell_id, dropoff_cell_id
                ORDER BY pickup_datetime
                ROWS BETWEEN 1000 PRECEDING
                AND 1 PRECEDING
            ) AS avg_duration_per_hour_per_cell_pair,

            AVG(trip_duration) OVER (
                PARTITION BY EXTRACT(WEEKDAY FROM pickup_datetime), pickup_cell_id, dropoff_cell_id
                ORDER BY pickup_datetime
                ROWS BETWEEN 1000 PRECEDING
                AND 1 PRECEDING
            ) AS avg_duration_per_weekday_per_cell_pair,

        FROM (
            SELECT
                trips.*,
                pickup_cells.cell_id AS pickup_cell_id,
                dropoff_cells.cell_id AS dropoff_cell_id
            FROM trips
            LEFT JOIN cells AS pickup_cells ON
                pickup_latitude BETWEEN pickup_cells.cell_min_lat AND pickup_cells.cell_max_lat
                AND pickup_longitude BETWEEN pickup_cells.cell_min_lon AND pickup_cells.cell_max_lon
            LEFT JOIN cells AS dropoff_cells ON
                dropoff_latitude BETWEEN dropoff_cells.cell_min_lat AND dropoff_cells.cell_max_lat
                AND dropoff_longitude BETWEEN dropoff_cells.cell_min_lon AND dropoff_cells.cell_max_lon
        )
    )
    ''')
    geo_aggs = db.execute('SELECT * FROM geo_aggs').fetch_df()

geo_aggs.head()


Unnamed: 0,id,pickup_cell_id,dropoff_cell_id,avg_duration_per_cell_pair,avg_duration_per_hour_per_cell_pair,avg_duration_per_weekday_per_cell_pair
0,id1615460,0-0,2-2,368.0,,
1,id0847078,0-0,3-4,,,
2,id3967876,0-10,1-7,922.0,922.0,
3,id3073001,0-10,1-7,3865.5,874.666667,85330.0
4,id3819150,0-10,1-7,3411.454545,521.0,43521.0


In [48]:
with duckdb.connect('taxi-trips.db') as db:
    dataset = (
        db.execute('''
        SELECT
            trips.id,
            trips.pickup_datetime,
            targets.trip_duration,
            distances.* EXCLUDE (id),
            time_features.* EXCLUDE (id),
            recent_time_aggs.* EXCLUDE (id),
            cell_pair_counts.* EXCLUDE (id),
            geo_aggs.* EXCLUDE (id)
        FROM trips
        LEFT JOIN targets USING (id)
        LEFT JOIN distances USING (id)
        LEFT JOIN time_features USING (id)
        LEFT JOIN rolling_time_aggs USING (id)
        LEFT JOIN recent_time_aggs USING (id)
        LEFT JOIN cell_pair_counts USING (id)
        LEFT JOIN geo_aggs USING (id)
        ''')
        .fetch_df()
    )

train_and_test(dataset)


[MAE] 0:03:33.325916
hour: 183
l2_distance: 143
dropoff_cell_id: 129
avg_duration_per_hour_per_cell_pair: 113
weekday: 94
avg_duration_recent: 53
pickup_cell_id: 48
avg_duration_per_weekday_per_cell_pair: 47
avg_duration_per_weekday_recent: 37
avg_duration_per_cell_pair: 23
l1_distance: 15
avg_duration_per_hour_recent: 13
cell_pair_count: 2
pickup_cell_id_2: 0
dropoff_cell_id_2: 0


## Bayesian averaging

One issue with averages is that they hide the magnitude of the denominator. For instance, if we have a single trip in a cell pair that lasted 10 hours, the average duration will be 10 hours, even if all other trips lasted 10 minutes on average. This is a problem because we want to predict the duration of each trip, not the average duration of all trips.

Even if we take a popular cell pair, the average duration will be biased by the number of trips at first:

In [49]:
(
    dataset.query('pickup_cell_id == "3-11" and dropoff_cell_id == "2-10"')
    [['pickup_datetime', 'trip_duration', 'avg_duration_per_cell_pair']]
    .plot(x='pickup_datetime', y=['avg_duration_per_cell_pair'])
)


One solution is to do [Bayesian averaging](https://www.wikiwand.com/en/Bayesian_average). See also this article on how this is used at [IMDb](https://www.fxsolver.com/browse/formulas/Bayes+estimator+-+Internet+Movie+Database+%28IMDB%29).

In [50]:
with duckdb.connect('taxi-trips.db') as db:
    db.execute('''
    CREATE OR REPLACE VIEW geo_counts AS (
        SELECT
            id,
            pickup_cell_id,
            dropoff_cell_id,

            COUNT(*) OVER (
                PARTITION BY pickup_cell_id, dropoff_cell_id
                ORDER BY pickup_datetime
                ROWS BETWEEN 1000 PRECEDING
                AND 1 PRECEDING
            ) AS count_per_cell_pair,

            COUNT(*) OVER (
                PARTITION BY EXTRACT(HOUR FROM pickup_datetime), pickup_cell_id, dropoff_cell_id
                ORDER BY pickup_datetime
                ROWS BETWEEN 1000 PRECEDING
                AND 1 PRECEDING
            ) AS count_per_hour_per_cell_pair,

            COUNT(*) OVER (
                PARTITION BY EXTRACT(WEEKDAY FROM pickup_datetime), pickup_cell_id, dropoff_cell_id
                ORDER BY pickup_datetime
                ROWS BETWEEN 1000 PRECEDING
                AND 1 PRECEDING
            ) AS count_per_weekday_per_cell_pair,

        FROM (
            SELECT
                trips.*,
                pickup_cells.cell_id AS pickup_cell_id,
                dropoff_cells.cell_id AS dropoff_cell_id
            FROM trips
            LEFT JOIN cells AS pickup_cells ON
                pickup_latitude BETWEEN pickup_cells.cell_min_lat AND pickup_cells.cell_max_lat
                AND pickup_longitude BETWEEN pickup_cells.cell_min_lon AND pickup_cells.cell_max_lon
            LEFT JOIN cells AS dropoff_cells ON
                dropoff_latitude BETWEEN dropoff_cells.cell_min_lat AND dropoff_cells.cell_max_lat
                AND dropoff_longitude BETWEEN dropoff_cells.cell_min_lon AND dropoff_cells.cell_max_lon
        )
    )
    ''')
    geo_counts = db.execute('SELECT * FROM geo_counts').fetch_df()

geo_counts.head()


Unnamed: 0,id,pickup_cell_id,dropoff_cell_id,count_per_cell_pair,count_per_hour_per_cell_pair,count_per_weekday_per_cell_pair
0,id1615460,0-0,2-2,1.0,,
1,id0847078,0-0,3-4,,,
2,id3967876,0-10,1-7,1.0,1.0,
3,id3073001,0-10,1-7,28.0,3.0,1.0
4,id3819150,0-10,1-7,33.0,2.0,2.0


In [52]:
average_duration = 900  # in seconds
weight = 100

with duckdb.connect('taxi-trips.db') as db:
    dataset = (
        db.execute(f'''
        SELECT
            trips.id,
            trips.pickup_datetime,
            targets.trip_duration,
            distances.* EXCLUDE (id),
            time_features.* EXCLUDE (id),
            recent_time_aggs.* EXCLUDE (id),
            cell_pair_counts.* EXCLUDE (id),
            -- Bayesian average: (A * WA) + (B * WB) / (WA + WB)
            (
                (
                    geo_counts.count_per_cell_pair * geo_aggs.avg_duration_per_cell_pair +
                    {weight} * {average_duration}
                ) /
                (geo_counts.count_per_cell_pair + {weight})
            ) AS avg_duration_per_cell_pair,
            (
                (
                    geo_counts.count_per_hour_per_cell_pair * geo_aggs.avg_duration_per_hour_per_cell_pair +
                    {weight} * {average_duration}
                ) /
                (geo_counts.count_per_cell_pair + {weight})
            ) AS avg_duration_per_hour_per_cell_pair,
            (
                (
                    geo_counts.count_per_weekday_per_cell_pair * geo_aggs.avg_duration_per_weekday_per_cell_pair +
                    {weight} * {average_duration}
                ) /
                (geo_counts.count_per_cell_pair + {weight})
            ) AS avg_duration_per_weekday_per_cell_pair
        FROM trips
        LEFT JOIN targets USING (id)
        LEFT JOIN distances USING (id)
        LEFT JOIN time_features USING (id)
        LEFT JOIN rolling_time_aggs USING (id)
        LEFT JOIN recent_time_aggs USING (id)
        LEFT JOIN cell_pair_counts USING (id)
        LEFT JOIN geo_aggs USING (id)
        LEFT JOIN geo_counts USING (id)
        ''')
        .fetch_df()
    )

train_and_test(dataset)


[MAE] 0:03:36.170476
hour: 207
dropoff_cell_id: 197
l2_distance: 152
pickup_cell_id: 91
weekday: 80
avg_duration_recent: 41
avg_duration_per_cell_pair: 39
avg_duration_per_weekday_recent: 34
avg_duration_per_hour_recent: 18
cell_pair_count: 17
l1_distance: 15
avg_duration_per_hour_per_cell_pair: 6
avg_duration_per_weekday_per_cell_pair: 3


In [53]:
(
    dataset.query('pickup_cell_id == "3-11" and dropoff_cell_id == "2-10"')
    [['pickup_datetime', 'trip_duration', 'avg_duration_per_cell_pair']]
    .plot(x='pickup_datetime', y=['avg_duration_per_cell_pair'])
)


This didn't get a mega boost, but it's useful to know about. Usually, the more you progress, the more you'll want to want to aggregate the target along exotic dimensions. Bayesian averaging is one way to prevent overfitting.

## Save for the next episode

In [54]:
dataset.to_pickle('../../data/taxi_trip_dataset.pkl')
