# Homework "Data Ingestion with dlt"

---

We’ll use NYC Taxi data via the same custom API from the workshop:

**Base API URL**:
https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api

**Data format**: Paginated JSON (1,000 records per page).

**API Pagination**: Stop when an empty page is returned.

---

# Question 1: dlt Version
1. **Install** dlt:

In [10]:
!pip install 'dlt[duckdb]'



2. **Check** version:

In [3]:
!dlt --version

[39mdlt 1.6.1[0m


In [4]:
import dlt
print("dlt version:", dlt.__version__)

dlt version: 1.6.1


### Answer: **1.6.1**

---


# Question 2: Define & Run the Pipeline (NYC Taxi API)
1. Use the @dlt.resource decorator to define the API source.
2. Implement automatic pagination using dlt's built-in REST client.
3. Load the extracted data into DuckDB for querying.

In [11]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator

In [12]:
# your code is here
# Define the API resource for NYC taxi data
@dlt.resource(name="rides")   # <--- The name of the resource (will be used as the table name)
def ny_taxi():
    client = RESTClient(
        base_url="https://us-central1-dlthub-analytics.cloudfunctions.net",
        paginator=PageNumberPaginator(
            base_page=1,
            total_path=None
        )
    )

    for page in client.paginate("data_engineering_zoomcamp_api"):    # <--- API endpoint for retrieving taxi ride data
        yield page   # <--- yield data to manage memory


pipeline = dlt.pipeline(
    pipeline_name="ny_taxi_pipeline",
    destination="duckdb",
    dataset_name="ny_taxi_data"
)

Load the data into DuckDB to test:

In [13]:
load_info = pipeline.run(ny_taxi)
print(load_info)

Pipeline ny_taxi_pipeline load step completed in 0.76 seconds
1 load package(s) were loaded to destination duckdb and into dataset ny_taxi_data
The duckdb destination used duckdb:////Users/zharauai/de-zoomcamp-2025/workshop-dlt/ny_taxi_pipeline.duckdb location to store data
Load package 1739791364.340128 is LOADED and contains no failed jobs


Start a connection to your database using native duckdb connection and look what tables were generated:

In [14]:
import duckdb
from google.colab import data_table
data_table.enable_dataframe_formatter()

# A database '<pipeline_name>.duckdb' was created in working directory so just connect to it

# Connect to the DuckDB database
conn = duckdb.connect(f"{pipeline.pipeline_name}.duckdb")

# Set search path to the dataset
conn.sql(f"SET search_path = '{pipeline.dataset_name}'")

# Describe the dataset
conn.sql("DESCRIBE").df()

ModuleNotFoundError: No module named 'google'

- How many tables were created?

  ### Answer:   4
---

# Question 3: Explore the loaded data
Inspect the table ride:

In [15]:
df = pipeline.dataset(dataset_type="default").rides.df()
df

ModuleNotFoundError: No module named 'numpy'

- What is the total number of records extracted?

  **Answer**: 10,000

  ---

# Question 4: Trip Duration Analysis
Run the SQL query below to:

- Calculate the average trip duration in minutes.

In [None]:
with pipeline.sql_client() as client:
    res = client.execute_sql(
            """
            SELECT
            AVG(date_diff('minute', trip_pickup_date_time, trip_dropoff_date_time))
            FROM rides;
            """
        )
    # Prints column values of the first row
    print(res)

- What is the average trip duration?

  **Answer**:   12,3049