# **Workshop "Data Ingestion with dlt": Homework**


---

## **Dataset & API**

We‚Äôll use **NYC Taxi data** via the same custom API from the workshop:

üîπ **Base API URL:**  
```
https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api
```
üîπ **Data format:** Paginated JSON (1,000 records per page).  
üîπ **API Pagination:** Stop when an empty page is returned.  

## **Question 1: dlt Version**

Before installing dlt, check which Python environment is in use:

In [7]:
import sys
print(sys.executable)

/home/alessandro/GIT/.venv/bin/python


1. **Install dlt**:

In [3]:
 !pip install dlt[duckdb]

Collecting dlt[duckdb]
  Using cached dlt-1.7.0-py3-none-any.whl.metadata (11 kB)
Collecting PyYAML>=5.4.1 (from dlt[duckdb])
  Using cached PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting click>=7.1 (from dlt[duckdb])
  Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Collecting duckdb>=0.9 (from dlt[duckdb])
  Using cached duckdb-1.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (966 bytes)
Collecting fsspec>=2022.4.0 (from dlt[duckdb])
  Using cached fsspec-2025.2.0-py3-none-any.whl.metadata (11 kB)
Collecting gitpython>=3.1.29 (from dlt[duckdb])
  Using cached GitPython-3.1.44-py3-none-any.whl.metadata (13 kB)
Collecting giturlparse>=0.10.0 (from dlt[duckdb])
  Using cached giturlparse-0.12.0-py2.py3-none-any.whl.metadata (4.5 kB)
Collecting hexbytes>=0.2.2 (from dlt[duckdb])
  Using cached hexbytes-1.3.0-py3-none-any.whl.metadata (3.3 kB)
Collecting humanize>=4.4.0 (from dlt[duckdb])
  Using ca

> Or choose a different bracket‚Äî`bigquery`, `redshift`, etc.‚Äîif you prefer another primary destination. For this assignment, we‚Äôll still do a quick test with DuckDB.

2. **Check** the version:


In [5]:
!dlt --version

[39mdlt 1.7.0[0m


or:

In [6]:
import dlt
print("dlt version:", dlt.__version__)

dlt version: 1.7.0


**Answer**:  
- Provide the **version** you see in the output.

## **Install DuckDB**

Before proceeding, install DuckDB in the Python environment, if not already installed:

In [8]:
 !pip install duckdb



## **Question 2: Define & Run the Pipeline (NYC Taxi API)**

Use dlt to extract all pages of data from the API.

Steps:

1Ô∏è‚É£ Use the `@dlt.resource` decorator to define the API source.

2Ô∏è‚É£ Implement automatic pagination using dlt's built-in REST client.

3Ô∏è‚É£ Load the extracted data into DuckDB for querying.



In [9]:
import dlt
from dlt.sources.helpers.rest_client import RESTClient
from dlt.sources.helpers.rest_client.paginators import PageNumberPaginator

In [None]:
# your code is here

In [None]:
pipeline = dlt.pipeline(
    pipeline_name="ny_taxi_pipeline",
    destination="duckdb",
    dataset_name="ny_taxi_data"
)

Load the data into DuckDB to test:






In [None]:
load_info = pipeline.run(ny_taxi)
print(load_info)

Start a connection to your database using native `duckdb` connection and look what tables were generated:

In [None]:
import duckdb
from google.colab import data_table
data_table.enable_dataframe_formatter()

# A database '<pipeline_name>.duckdb' was created in working directory so just connect to it

# Connect to the DuckDB database
conn = duckdb.connect(f"{pipeline.pipeline_name}.duckdb")

# Set search path to the dataset
conn.sql(f"SET search_path = '{pipeline.dataset_name}'")

# Describe the dataset
conn.sql("DESCRIBE").df()

**Answer:**
* How many tables were created?

## **Question 3: Explore the loaded data**

Inspect the table `ride`:


In [None]:
df = pipeline.dataset(dataset_type="default").rides.df()
df

**Answer:**
* What is the total number of records extracted?

## **Question 4: Trip Duration Analysis**

Run the SQL query below to:

* Calculate the average trip duration in minutes.

In [None]:
with pipeline.sql_client() as client:
    res = client.execute_sql(
            """
            SELECT
            AVG(date_diff('minute', trip_pickup_date_time, trip_dropoff_date_time))
            FROM rides;
            """
        )
    # Prints column values of the first row
    print(res)

**Answer:**
* What is the average trip duration?

## **Submitting the solutions**

* Form for submitting: TBA




## **Solution**

We will publish the solution here after deadline.