### Link to homework
https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/cohorts/2024/01-docker-terraform/homework.md


In [1]:
import pandas as pd
from sqlalchemy import create_engine, text
from urllib.parse import quote_plus
from time import time

In [2]:
trip_data_file_path = "/workspaces/data-engineering-zoomcamp/data/green_tripdata_2019-10.csv.gz"
green_table_name = "green_tripdata_2019_10"

In [3]:
encoded_password = quote_plus("P@ssw0rd!")
print(encoded_password)
engine = create_engine(f"postgresql://postgres:{encoded_password}@db:5432/ny_taxi")

P%40ssw0rd%21


In [4]:
engine.connect()

<sqlalchemy.engine.base.Connection at 0x7f6650a65f40>

In [5]:
query = """
    SELECT *
    FROM green_tripdata_2019_10
    LIMIT 1
"""

pd.read_sql(query, engine)

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2,2019-10-01 00:26:02,2019-10-01 00:39:58,N,1.0,112,196,1.0,5.88,18.0,0.5,0.5,0.0,0.0,,0.3,19.3,2.0,1.0,0.0


## Question 1. Understanding docker first run 

Run docker with the `python:3.12.8` image in an interactive mode, use the entrypoint `bash`.

What's the version of `pip` in the image?

- 24.3.1
- 24.2.1
- 23.3.1
- 23.2.1

Answer: Answer: root@3b27cc74845d:/# pip --version
pip 24.3.1

## Question 2. Understanding Docker networking and docker-compose

Given the following `docker-compose.yaml`, what is the `hostname` and `port` that **pgadmin** should use to connect to the postgres database?

```yaml
services:
  db:
    container_name: postgres
    image: postgres:17-alpine
    environment:
      POSTGRES_USER: 'postgres'
      POSTGRES_PASSWORD: 'postgres'
      POSTGRES_DB: 'ny_taxi'
    ports:
      - '5433:5432'
    volumes:
      - vol-pgdata:/var/lib/postgresql/data

  pgadmin:
    container_name: pgadmin
    image: dpage/pgadmin4:latest
    environment:
      PGADMIN_DEFAULT_EMAIL: "pgadmin@pgadmin.com"
      PGADMIN_DEFAULT_PASSWORD: "pgadmin"
    ports:
      - "8080:80"
    volumes:
      - vol-pgadmin_data:/var/lib/pgadmin  

volumes:
  vol-pgdata:
    name: vol-pgdata
  vol-pgadmin_data:
    name: vol-pgadmin_data
```

- postgres:5433
- localhost:5432
- db:5433
- postgres:5432
- db:5432

Answer: db:5432

##  Prepare Postgres

Run Postgres and load data as shown in the videos
We'll use the green taxi trips from October 2019:

```bash
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-10.csv.gz
```

You will also need the dataset with zones:

```bash
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv
```

Download this data and put it into Postgres.

You can use the code from the course. It's up to you whether
you want to use Jupyter or a python script.

## Question 3. Trip Segmentation Count

During the period of October 1st 2019 (inclusive) and November 1st 2019 (exclusive), how many trips, **respectively**, happened:
1. Up to 1 mile
2. In between 1 (exclusive) and 3 miles (inclusive),
3. In between 3 (exclusive) and 7 miles (inclusive),
4. In between 7 (exclusive) and 10 miles (inclusive),
5. Over 10 miles 

Answers:

- 104,802;  197,670;  110,612;  27,831;  35,281
- 104,802;  198,924;  109,603;  27,678;  35,189
- 104,793;  201,407;  110,612;  27,831;  35,281
- 104,793;  202,661;  109,603;  27,678;  35,189
- 104,838;  199,013;  109,645;  27,688;  35,202


Answer: - 104,802;  198,924;  109,603;  27,678;  35,189


In [6]:
query = f"""
SELECT
  COUNT(GTD_UP_TO_1.trip_distance) AS UP_TO_1
FROM green_tripdata_2019_10 AS GTD_UP_TO_1
WHERE
  GTD_UP_TO_1."lpep_pickup_datetime" >= '2019-10-01 00:00:00'
  AND GTD_UP_TO_1."lpep_dropoff_datetime" < '2019-11-01 00:00:00'
  AND GTD_UP_TO_1.trip_distance <= 1
"""

query_between_1_3 = f"""
SELECT COUNT(GTD.trip_distance) AS GTD
FROM green_tripdata_2019_10 AS GTD
WHERE
  GTD."lpep_pickup_datetime" >= '2019-10-01 00:00:00'
  AND GTD."lpep_dropoff_datetime" < '2019-11-01 00:00:00'
  AND GTD.trip_distance > 1 AND GTD.trip_distance <= 3
"""

query_between_3_7 = f"""
SELECT COUNT(GTD.trip_distance) AS GTD
FROM green_tripdata_2019_10 AS GTD
WHERE
  GTD."lpep_pickup_datetime" >= '2019-10-01 00:00:00'
  AND GTD."lpep_dropoff_datetime" < '2019-11-01 00:00:00'
  AND GTD.trip_distance > 3 AND GTD.trip_distance <= 7
"""

query_between_7_10 = f"""
SELECT COUNT(GTD.trip_distance) AS GTD
FROM green_tripdata_2019_10 AS GTD
WHERE
  GTD."lpep_pickup_datetime" >= '2019-10-01 00:00:00'
  AND GTD."lpep_dropoff_datetime" < '2019-11-01 00:00:00'
  AND GTD.trip_distance > 7 AND GTD.trip_distance <= 10
"""

query_over_10 = f"""
SELECT COUNT(GTD.trip_distance) AS GTD
FROM green_tripdata_2019_10 AS GTD
WHERE
  GTD."lpep_pickup_datetime" >= '2019-10-01 00:00:00'
  AND GTD."lpep_dropoff_datetime" < '2019-11-01 00:00:00'
  AND GTD.trip_distance > 10
"""

print(f"Up to 1: {pd.read_sql_query(query, engine)}")
print(f"Between 1 and 3: {pd.read_sql_query(query_between_1_3, engine)}")
print(f"Between 3 and 7: {pd.read_sql_query(query_between_3_7, engine)}")
print(f"Between 7 and 10: {pd.read_sql_query(query_between_7_10, engine)}")
print(f"Over 10: {pd.read_sql_query(query_over_10, engine)}")

Up to 1:    up_to_1
0   104802
Between 1 and 3:       gtd
0  198924
Between 3 and 7:       gtd
0  109603
Between 7 and 10:      gtd
0  27678
Over 10:      gtd
0  35189


## Question 4. Longest trip for each day

Which was the pick up day with the longest trip distance?
Use the pick up time for your calculations.

Tip: For every day, we only care about one single trip with the longest distance. 

- 2019-10-11
- 2019-10-24
- 2019-10-26
- 2019-10-31

In [7]:
query = f"""
SELECT COUNT(*) FROM {green_table_name}
"""

pd.read_sql_query(query, engine)

Unnamed: 0,count
0,476386


In [8]:
query_longest = f"""
SELECT lpep_pickup_datetime, * FROM green_tripdata_2019_10 AS GTD
ORDER BY GTD.trip_distance DESC
LIMIT 1
"""

pd.read_sql_query(query_longest, engine)

Unnamed: 0,lpep_pickup_datetime,VendorID,lpep_pickup_datetime.1,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,...,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2019-10-31 23:23:41,2,2019-10-31 23:23:41,2019-11-01 13:01:07,N,5.0,129,265,1.0,515.89,...,2.75,0.0,0.0,0.0,,0.3,103.05,2.0,1.0,0.0


Answer: the longest trip was on 2019-10-31

## Question 5. Three biggest pickup zones

Which were the top pickup locations with over 13,000 in
`total_amount` (across all trips) for 2019-10-18?

Consider only `lpep_pickup_datetime` when filtering by date.
 
- East Harlem North, East Harlem South, Morningside Heights
- East Harlem North, Morningside Heights
- Morningside Heights, Astoria Park, East Harlem South
- Bedford, East Harlem North, Astoria Park

In [9]:
query_taxi_zones = f"""
    SELECT *
    FROM taxi_zones
    LIMIT 10
"""

pd.read_sql(query_taxi_zones, engine)

Unnamed: 0,index,LocationID,Borough,Zone
0,0,1,EWR,Newark Airport
1,1,2,Queens,Jamaica Bay
2,2,3,Bronx,Allerton/Pelham Gardens
3,3,4,Manhattan,Alphabet City
4,4,5,Staten Island,Arden Heights
5,5,6,Staten Island,Arrochar/Fort Wadsworth
6,6,7,Queens,Astoria
7,7,8,Queens,Astoria Park
8,8,9,Queens,Auburndale
9,9,10,Queens,Baisley Park


In [15]:
query = f"""
SELECT taxi_zones."Zone", SUM(total_amount) AS total_amount
FROM green_tripdata_2019_10 AS GTD
INNER JOIN taxi_zones ON GTD."PULocationID" = taxi_zones."LocationID"
WHERE GTD."lpep_pickup_datetime" >= '2019-10-18 00:00:00'
  AND GTD."lpep_pickup_datetime" <= '2019-10-18 23:59:59'
  AND (taxi_zones."Zone" = 'East Harlem North'
    OR taxi_zones."Zone" = 'East Harlem South'
    OR taxi_zones."Zone" = 'Morningside Heights'
    OR taxi_zones."Zone" = 'Astoria Park'
    OR taxi_zones."Zone" = 'Bedford')
GROUP BY taxi_zones."Zone"
ORDER BY SUM(total_amount) DESC
"""
# --

pd.read_sql(query, engine)

Unnamed: 0,Zone,total_amount
0,East Harlem North,18686.68
1,East Harlem South,16797.26
2,Morningside Heights,13029.79
3,Bedford,2441.65


Where is "Astoria Park"?

In [18]:
query = f"""
SELECT taxi_zones."Zone", CAST(GTD."lpep_pickup_datetime" AS Date), SUM(total_amount) AS total_amount
FROM green_tripdata_2019_10 AS GTD
INNER JOIN taxi_zones ON GTD."PULocationID" = taxi_zones."LocationID"
WHERE taxi_zones."Zone" = 'Astoria Park'
GROUP BY taxi_zones."Zone", CAST(GTD."lpep_pickup_datetime" AS Date)
ORDER BY SUM(total_amount) DESC
"""

pd.read_sql(query, engine)

Unnamed: 0,Zone,lpep_pickup_datetime,total_amount
0,Astoria Park,2019-10-26,66.12
1,Astoria Park,2019-10-20,30.96
2,Astoria Park,2019-10-19,29.3
3,Astoria Park,2019-10-31,7.3
4,Astoria Park,2019-10-25,6.8


Answer to question 5 is "East Harlem North, East Harlem South, Morningside Heights"

## Question 6. Largest tip

For the passengers picked up in Ocrober 2019 in the zone
name "East Harlem North" which was the drop off zone that had
the largest tip?

Note: it's `tip` , not `trip`

We need the name of the zone, not the ID.

- Yorkville West
- JFK Airport
- East Harlem North
- East Harlem South

Lets investigate this zone:

In [20]:
query = f"""
SELECT *
FROM taxi_zones AS TZ
WHERE TZ."Zone" = 'East Harlem North'
LIMIT 10
"""

pd.read_sql(query, engine)

Unnamed: 0,index,LocationID,Borough,Zone
0,73,74,Manhattan,East Harlem North


Get the higher tip

In [22]:
query = f"""
SELECT GTD."DOLocationID"
FROM green_tripdata_2019_10 AS GTD
INNER JOIN taxi_zones AS TZ ON GTD."PULocationID" = TZ."LocationID"
WHERE TZ."Zone" = 'East Harlem North'
ORDER BY GTD."tip_amount" DESC
LIMIT 1
"""

pd.read_sql(query, engine)

Unnamed: 0,DOLocationID
0,132


In [25]:
query = f"""
WITH top_tip_location_id (location_id) AS (
  SELECT GTD."DOLocationID"
  FROM green_tripdata_2019_10 AS GTD
  INNER JOIN taxi_zones AS TZ ON GTD."PULocationID" = TZ."LocationID"
  WHERE TZ."Zone" = 'East Harlem North'
  ORDER BY GTD."tip_amount" DESC
  LIMIT 1
)
SELECT TZ."Zone"
FROM taxi_zones AS TZ, top_tip_location_id AS TTLI
WHERE TZ."LocationID" = TTLI."location_id"
"""

print(f"Answer to question 6: {pd.read_sql(query, engine)}")

Answer to question 6:           Zone
0  JFK Airport


## Terraform

In this section homework we'll prepare the environment by creating resources in GCP with Terraform.

In your VM on GCP/Laptop/GitHub Codespace install Terraform. 
Copy the files from the course repo
[here](../../../01-docker-terraform/1_terraform_gcp/terraform) to your VM/Laptop/GitHub Codespace.

Modify the files as necessary to create a GCP Bucket and Big Query Dataset.

## Question 7. Terraform Workflow

Which of the following sequences, **respectively**, describes the workflow for: 
1. Downloading the provider plugins and setting up backend,
2. Generating proposed changes and auto-executing the plan
3. Remove all resources managed by terraform`

Answers:
- terraform import, terraform apply -y, terraform destroy
- teraform init, terraform plan -auto-apply, terraform rm
- terraform init, terraform run -auto-approve, terraform destroy
- terraform init, terraform apply -auto-approve, terraform destroy
- terraform import, terraform apply -y, terraform rm

Answer: - terraform init, terraform apply -auto-approve, terraform destroy