Example: Incremental loading with dlt
The goal: download only trips made after June 15, 2009, skipping the old ones.

Using dlt, we set up an incremental filter to only fetch trips made after a certain date:

cursor_date = dlt.sources.incremental("Trip_Dropoff_DateTime", initial_value="2009-06-15")

This tells dlt:

Start date: June 15, 2009 (initial_value).
Field to track: Trip_Dropoff_DateTime (our timestamp).
As you run the pipeline repeatedly, dlt will keep track of the latest Trip_Dropoff_DateTime value processed. It will skip records older than this date in future runs.

Let's make the data resource incremental using dlt.sources.incremental:

In [1]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator


@dlt.resource(name="rides", write_disposition="append")
def ny_taxi(
    cursor_date=dlt.sources.incremental(
        "Trip_Dropoff_DateTime",   # <--- field to track, our timestamp
        initial_value="2009-06-15",   # <--- start date June 15, 2009
        )
    ):
    client = RESTClient(
        base_url="https://us-central1-dlthub-analytics.cloudfunctions.net",
        paginator=PageNumberPaginator(
            base_page=1,
            total_path=None
        )
    )

    for page in client.paginate("data_engineering_zoomcamp_api"):
        yield page

Finally, we run our pipeline and load the fresh taxi rides data:

In [2]:
# define new dlt pipeline
pipeline = dlt.pipeline(pipeline_name="ny_taxi", destination="duckdb", dataset_name="ny_taxi_data")

# run the pipeline with the new resource
load_info = pipeline.run(ny_taxi)
print(pipeline.last_trace)

Run started at 2025-02-18 19:05:20.296412+00:00 and COMPLETED in 33.24 seconds with 4 steps.
Step extract COMPLETED in 30.77 seconds.

Load package 1739905521.5706 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.06 seconds.
No data found to normalize

Step load COMPLETED in 1.17 seconds.
Pipeline ny_taxi load step completed in ---
0 load package(s) were loaded to destination duckdb and into dataset None
The duckdb destination used duckdb:////workspaces/DEZoomCamp2025/workshop/ny_taxi.duckdb location to store data

Step run COMPLETED in 33.24 seconds.
Pipeline ny_taxi load step completed in ---
0 load package(s) were loaded to destination duckdb and into dataset None
The duckdb destination used duckdb:////workspaces/DEZoomCamp2025/workshop/ny_taxi.duckdb location to store data


Only 5325 rows were flitered out and loaded into the duckdb destination. Let's take a look at the earliest date in the loaded data:

In [3]:
with pipeline.sql_client() as client:
    res = client.execute_sql(
            """
            SELECT
            MIN(trip_dropoff_date_time)
            FROM rides;
            """
        )
    print(res)

[(datetime.datetime(2009, 6, 15, 0, 6, tzinfo=<UTC>),)]


Run the same pipeline again.

In [4]:
# define new dlt pipeline
pipeline = dlt.pipeline(pipeline_name="ny_taxi", destination="duckdb", dataset_name="ny_taxi_data")


# run the pipeline with the new resource
load_info = pipeline.run(ny_taxi)
print(pipeline.last_trace)

Run started at 2025-02-18 19:06:14.081040+00:00 and COMPLETED in 29.58 seconds with 4 steps.
Step extract COMPLETED in 28.73 seconds.

Load package 1739905574.9094021 is EXTRACTED and NOT YET LOADED to the destination and contains no failed jobs

Step normalize COMPLETED in 0.05 seconds.
No data found to normalize

Step load COMPLETED in 0.02 seconds.
Pipeline ny_taxi load step completed in ---
0 load package(s) were loaded to destination duckdb and into dataset None
The duckdb destination used duckdb:////workspaces/DEZoomCamp2025/workshop/ny_taxi.duckdb location to store data

Step run COMPLETED in 29.58 seconds.
Pipeline ny_taxi load step completed in ---
0 load package(s) were loaded to destination duckdb and into dataset None
The duckdb destination used duckdb:////workspaces/DEZoomCamp2025/workshop/ny_taxi.duckdb location to store data


The pipeline will detect that there are no new records based on the Trip_Dropoff_DateTime field and the incremental cursor. As a result, no new data will be loaded into the destination:

0 load package(s) were loaded