https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2026/01-docker-terraform/homework.md

## Docker & SQL

In [1]:
import pandas as pd
from sqlalchemy import create_engine

In this homework we'll prepare the environment and practice Docker and SQL

### Question 1. Understanding Docker images

Run docker with the `python:3.13` image. Use an entrypoint `bash` to interact with the container.

```bash
docker run -it --rm --entrypoint=bash python:3.13 
```

In [3]:
!docker ps

CONTAINER ID   IMAGE         COMMAND   CREATED          STATUS          PORTS     NAMES
0f51634f24e1   python:3.13   "bash"    56 seconds ago   Up 55 seconds             suspicious_tharp


In [4]:
!docker exec -it 0f51634f24e1 pip --version

pip 25.3 from /usr/local/lib/python3.13/site-packages/pip (python 3.13)


What's the version of `pip` in the image?

Ans: 25.3

### Question 2. Understanding Docker networking and docker-compose

Given the following `docker-compose.yaml`, what is the `hostname` and `port` that pgadmin should use to connect to the postgres database?

```yaml
services:
  db:
    container_name: postgres
    image: postgres:17-alpine
    environment:
      POSTGRES_USER: 'postgres'
      POSTGRES_PASSWORD: 'postgres'
      POSTGRES_DB: 'ny_taxi'
    ports:
      - '5433:5432'
    volumes:
      - vol-pgdata:/var/lib/postgresql/data

  pgadmin:
    container_name: pgadmin
    image: dpage/pgadmin4:latest
    environment:
      PGADMIN_DEFAULT_EMAIL: "pgadmin@pgadmin.com"
      PGADMIN_DEFAULT_PASSWORD: "pgadmin"
    ports:
      - "8080:80"
    volumes:
      - vol-pgadmin_data:/var/lib/pgadmin

volumes:
  vol-pgdata:
    name: vol-pgdata
  vol-pgadmin_data:
    name: vol-pgadmin_data
```

Ans: db:5432

### Prepare the Data

Download the green taxi trips data for November 2025:

In [5]:
!wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet

--2026-01-26 18:42:12--  https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 18.239.38.163, 18.239.38.147, 18.239.38.181, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|18.239.38.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1164775 (1.1M) [binary/octet-stream]
Saving to: ‘green_tripdata_2025-11.parquet’


2026-01-26 18:42:12 (80.4 MB/s) - ‘green_tripdata_2025-11.parquet’ saved [1164775/1164775]



You will also need the dataset with zones:

In [7]:
!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv

--2026-01-26 18:42:34--  https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 

302 Found
Location: https://release-assets.githubusercontent.com/github-production-release-asset/513814948/5a2cc2f5-b4cd-4584-9c62-a6ea97ed0e6a?sp=r&sv=2018-11-09&sr=b&spr=https&se=2026-01-26T19%3A32%3A04Z&rscd=attachment%3B+filename%3Dtaxi_zone_lookup.csv&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2026-01-26T18%3A31%3A42Z&ske=2026-01-26T19%3A32%3A04Z&sks=b&skv=2018-11-09&sig=h0N8ojSKpZ0IJ2pml5ov0IHKWNANkbCIbWfMOZ8ajNE%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc2OTQ1MzI1NCwibmJmIjoxNzY5NDUyOTU0LCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.hV1ZwtfznnPYnHxSnu17adENx1QejjCt7bmK1goRqes&response-content-disposition=attachment%3B%20filename%3Dtaxi_zone_lookup.csv&response-content-type=application%2Foctet-stream [following]
--2026-01-26 18:42:34--  https://release-assets.g

### Question 3. Counting short trips

For the trips in November 2025 (lpep_pickup_datetime between '2025-11-01' and '2025-12-01', exclusive of the upper bound), how many trips had a `trip_distance` of less than or equal to 1 mile?

In [2]:
df = pd.read_parquet("green_tripdata_2025-11.parquet", engine='pyarrow')
print(df.shape)
df.head()

(46912, 21)


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,...,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,cbd_congestion_fee
0,2,2025-11-01 00:34:48,2025-11-01 00:41:39,N,1.0,74,42,1.0,0.74,7.2,...,0.5,1.94,0.0,,1.0,11.64,1.0,1.0,0.0,0.0
1,2,2025-11-01 00:18:52,2025-11-01 00:24:27,N,1.0,74,42,2.0,0.95,7.2,...,0.5,0.0,0.0,,1.0,9.7,2.0,1.0,0.0,0.0
2,2,2025-11-01 01:03:14,2025-11-01 01:15:24,N,1.0,83,160,1.0,2.19,13.5,...,0.5,5.0,0.0,,1.0,21.0,1.0,1.0,0.0,0.0
3,2,2025-11-01 00:10:57,2025-11-01 00:24:53,N,1.0,166,127,1.0,5.44,24.7,...,0.5,0.5,0.0,,1.0,27.7,1.0,1.0,0.0,0.0
4,1,2025-11-01 00:03:48,2025-11-01 00:19:38,N,1.0,166,262,1.0,3.2,18.4,...,1.5,1.0,0.0,,1.0,24.65,1.0,1.0,2.75,0.0


In [5]:
df.query("lpep_pickup_datetime > '2025-11-01' \
          and lpep_pickup_datetime <= '2025-12-01' \
          and trip_distance <= 1").shape

(8007, 21)

Ans: 8007

### Question 4. Longest trip for each day

Which was the pick up day with the longest trip distance? Only consider trips with `trip_distance` less than 100 miles (to exclude data errors).

In [11]:
df.query('trip_distance < 100') \
    .sort_values('trip_distance', ascending=False)[['lpep_pickup_datetime','trip_distance']].head(1)

Unnamed: 0,lpep_pickup_datetime,trip_distance
18867,2025-11-14 15:36:27,88.03


Ans: 2025-11-14

### Question 5. Biggest pickup zone

Which was the pickup zone with the largest `total_amount` (sum of all trips) on November 18th, 2025?

In [18]:
df_locations = pd.read_csv('taxi_zone_lookup.csv')
print(df_locations.shape)
df_locations.head()

(265, 4)


Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


In [22]:
df.query("lpep_pickup_datetime >= '2025-11-18' and lpep_pickup_datetime < '2025-11-19'") \
    .groupby('PULocationID', as_index=False)['total_amount'].sum() \
    .merge(df_locations, left_on='PULocationID', right_on='LocationID') \
    .sort_values('total_amount', ascending=False)[['LocationID','Zone','total_amount']].head(1)

Unnamed: 0,LocationID,Zone,total_amount
39,74,East Harlem North,9281.92


Ans: East Harlem North

### Question 6. Largest tip

For the passengers picked up in the zone named "East Harlem North" in November 2025, which was the drop off zone that had the largest tip?

In [23]:
df.columns

Index(['VendorID', 'lpep_pickup_datetime', 'lpep_dropoff_datetime',
       'store_and_fwd_flag', 'RatecodeID', 'PULocationID', 'DOLocationID',
       'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax',
       'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge',
       'total_amount', 'payment_type', 'trip_type', 'congestion_surcharge',
       'cbd_congestion_fee'],
      dtype='object')

In [27]:
df.query('PULocationID == 74').groupby('DOLocationID', as_index=False)['tip_amount'].max() \
    .merge(df_locations, left_on='DOLocationID', right_on='LocationID') \
    .sort_values('tip_amount', ascending=False)[['LocationID','Zone','tip_amount']].head(1)

Unnamed: 0,LocationID,Zone,tip_amount
134,263,Yorkville West,81.89


Ans: Yorkville West

### Question 7. Terraform Workflow

Which of the following sequences, respectively, describes the workflow for:
1. Downloading the provider plugins and setting up backend,
2. Generating proposed changes and auto-executing the plan
3. Remove all resources managed by terraform`

Ans: terraform init, terraform apply -auto-approve, terraform destroy