## Question 1. Understanding docker first run 

Run docker with the `python:3.12.8` image in an interactive mode, use the entrypoint `bash`.

What's the version of `pip` in the image?

#### Commands: 

- ```docker run -it --entrypoint=bash python:3.12.8```
- ```pip list```

##### Answer: 24.3.1

## Question 2. Understanding Docker networking and docker-compose

Given the following `docker-compose.yaml`, what is the `hostname` and `port` that **pgadmin** should use to connect to the postgres database?

```yaml
services:
  db:
    container_name: postgres
    image: postgres:17-alpine
    environment:
      POSTGRES_USER: 'postgres'
      POSTGRES_PASSWORD: 'postgres'
      POSTGRES_DB: 'ny_taxi'
    ports:
      - '5433:5432'
    volumes:
      - vol-pgdata:/var/lib/postgresql/data

  pgadmin:
    container_name: pgadmin
    image: dpage/pgadmin4:latest
    environment:
      PGADMIN_DEFAULT_EMAIL: "pgadmin@pgadmin.com"
      PGADMIN_DEFAULT_PASSWORD: "pgadmin"
    ports:
      - "8080:80"
    volumes:
      - vol-pgadmin_data:/var/lib/pgadmin  

volumes:
  vol-pgdata:
    name: vol-pgdata
  vol-pgadmin_data:
    name: vol-pgadmin_data
```

#### Answer and Explanation:
- `pgadmin` can connect to both `db` and `postgres` As hostname to the postgres database container run by the above `docker_compose.yaml`. The open port via docker is `5432`

- Therefore, both ``db:5432`` and ``postgres:5432`` will connect `pgadmin` to the postgres database container

In [1]:
import pandas as pd

```bash
pip install sqlalchemy psycopg2-binary 
```

In [2]:
from sqlalchemy import create_engine

In [3]:
engine = create_engine('postgresql://root:root@localhost:5433/ny_taxi')

In [4]:
engine.connect()

<sqlalchemy.engine.base.Connection at 0x7fa5eebb16d0>

In [5]:
query = """
SELECT 1 as number;
"""

pd.read_sql(query, con=engine)

Unnamed: 0,number
0,1


In [6]:
query = """
SELECT *
FROM pg_catalog.pg_tables
WHERE schemaname != 'pg_catalog' AND 
    schemaname != 'information_schema';
"""

pd.read_sql(query, con=engine)

Unnamed: 0,schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
0,public,yellow_taxi_trips,root,,True,False,False,False
1,public,taxi_zones,root,,True,False,False,False
2,public,green_tripdata_trip,root,,False,False,False,False


## Question 3. Trip Segmentation Count

During the period of October 1st 2019 (inclusive) and November 1st 2019 (exclusive), how many trips, **respectively**, happened:
1. Up to 1 mile
2. In between 1 (exclusive) and 3 miles (inclusive),
3. In between 3 (exclusive) and 7 miles (inclusive),
4. In between 7 (exclusive) and 10 miles (inclusive),
5. Over 10 miles

In [7]:
# Reading the whole dat from the csv data
df = pd.read_csv('green_tripdata_2019-10.csv', low_memory=False)

In [8]:
# Converting the pickup and dropoff to appropriate date format
df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)

In [9]:
df.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2.0,2019-10-01 00:26:02,2019-10-01 00:39:58,N,1.0,112,196,1.0,5.88,18.0,0.5,0.5,0.0,0.0,,0.3,19.3,2.0,1.0,0.0
1,1.0,2019-10-01 00:18:11,2019-10-01 00:22:38,N,1.0,43,263,1.0,0.8,5.0,3.25,0.5,0.0,0.0,,0.3,9.05,2.0,1.0,0.0
2,1.0,2019-10-01 00:09:31,2019-10-01 00:24:47,N,1.0,255,228,2.0,7.5,21.5,0.5,0.5,0.0,0.0,,0.3,22.8,2.0,1.0,0.0
3,1.0,2019-10-01 00:37:40,2019-10-01 00:41:49,N,1.0,181,181,1.0,0.9,5.5,0.5,0.5,0.0,0.0,,0.3,6.8,2.0,1.0,0.0
4,2.0,2019-10-01 00:08:13,2019-10-01 00:17:56,N,1.0,97,188,1.0,2.52,10.0,0.5,0.5,2.26,0.0,,0.3,13.56,1.0,1.0,0.0


In [14]:
df.to_sql(name='green_tripdata_trip', con=engine, index=False)

In [11]:
#Describing the table columns
green_tx_clms = """
SELECT column_name, data_type, is_nullable 
FROM information_schema.columns 
WHERE table_name = 'green_tripdata_trip' AND table_schema = 'public';
"""
pd.read_sql(green_tx_clms, con=engine)

Unnamed: 0,column_name,data_type,is_nullable
0,VendorID,double precision,YES
1,lpep_pickup_datetime,timestamp without time zone,YES
2,lpep_dropoff_datetime,timestamp without time zone,YES
3,store_and_fwd_flag,text,YES
4,RatecodeID,double precision,YES
5,PULocationID,bigint,YES
6,DOLocationID,bigint,YES
7,passenger_count,double precision,YES
8,trip_distance,double precision,YES
9,fare_amount,double precision,YES


In [12]:
trip_segmentation = """
SELECT SUM(
    CASE WHEN trip_distance <= 1 THEN 1 ELSE 0 END
    ) AS "Up_to_1_Mile",
    SUM(
    CASE WHEN trip_distance BETWEEN 1.01 AND 3 THEN 1 ELSE 0 END
    ) AS "2_To_3_Miles",
    SUM(
    CASE WHEN trip_distance BETWEEN 3.01 AND 7 THEN 1 ELSE 0 END
    ) AS "4_To_7_Miles",
    SUM(
    CASE WHEN trip_distance BETWEEN 7.01 AND 10 THEN 1 ELSE 0 END
    ) AS "8_To_10_Miles",
    SUM(
    CASE WHEN trip_distance >10 THEN 1 ELSE 0 END
    ) AS "Over_10_Miles"
FROM public.green_tripdata_trip
WHERE (lpep_dropoff_datetime::date BETWEEN '2019-10-01' AND '2019-10-31')
    AND 
   (lpep_pickup_datetime::date BETWEEN '2019-10-01' AND '2019-10-31')
"""

pd.read_sql(trip_segmentation, con=engine)

Unnamed: 0,Up_to_1_Mile,2_To_3_Miles,4_To_7_Miles,8_To_10_Miles,Over_10_Miles
0,104802,198924,109603,27678,35189


## Question 4. Longest trip for each day

Which was the pick up day with the longest trip distance?
Use the pick up time for your calculations.

Tip: For every day, we only care about one single trip with the longest distance. 

In [13]:
# Getting the longest trip each day
longest_trip_per_day = """
SELECT lpep_pickup_datetime::date as day, MAX(trip_distance) longest_trip_distance 
FROM public.green_tripdata_trip
GROUP BY lpep_pickup_datetime::date
ORDER BY longest_trip_distance desc
"""
pd.read_sql(longest_trip_per_day, con=engine)

Unnamed: 0,day,longest_trip_distance
0,2019-10-31,515.89
1,2019-10-11,95.78
2,2019-10-26,91.56
3,2019-10-24,90.75
4,2019-10-05,85.23
5,2019-10-21,71.5
6,2019-10-14,70.03
7,2019-10-29,66.98
8,2019-10-22,65.98
9,2019-10-17,59.74


## Question 5. Three biggest pickup zones

Which were the top pickup locations with over 13,000 in
`total_amount` (across all trips) for 2019-10-18?

Consider only `lpep_pickup_datetime` when filtering by date.

In [25]:
#Describing the table columns
pick_up_zones = """
SELECT column_name, data_type, is_nullable 
FROM information_schema.columns 
WHERE table_name = 'taxi_zones' AND table_schema = 'public';
"""
pd.read_sql(pick_up_zones, con=engine)

Unnamed: 0,column_name,data_type,is_nullable
0,index,bigint,YES
1,LocationID,bigint,YES
2,Borough,text,YES
3,Zone,text,YES
4,service_zone,text,YES


In [32]:
# top_pickup locations
top_pick_up_zones = """
WITH top_pick_up_zones AS (SELECT "PULocationID" as LocationID, SUM(total_amount) total_amounts
FROM public.green_tripdata_trip
WHERE lpep_pickup_datetime::date = ' 2019-10-18'
GROUP BY "PULocationID"	
HAVING SUM(total_amount) > 1300
)
SELECT "Zone" AS Zone, total_amounts
FROM public.taxi_zones, top_pick_up_zones
WHERE public.taxi_zones."LocationID" = top_pick_up_zones.LocationID	
ORDER BY total_amounts DESC
"""
pd.read_sql(top_pick_up_zones, con=engine)

Unnamed: 0,zone,total_amounts
0,East Harlem North,18686.68
1,East Harlem South,16797.26
2,Morningside Heights,13029.79
3,Central Harlem,12440.66
4,Elmhurst,12431.96
...,...,...
56,University Heights/Morris Heights,1448.02
57,Saint Albans,1395.89
58,Spuyten Duyvil/Kingsbridge,1380.72
59,Flatlands,1368.51


## Question 6. Largest tip

For the passengers picked up in October 2019 in the zone
name "East Harlem North" which was the drop off zone that had
the largest tip?

Note: it's `tip` , not `trip`

We need the name of the zone, not the ID.

In [37]:
# largest tip amount for  "East Harlem North"
largest_tip = """
SELECT "PULocationID", "Zone", MAX(tip_amount)
FROM public.taxi_zones AS tz
INNER JOIN public.green_tripdata_trip AS gt
ON tz."LocationID" = gt."PULocationID"
WHERE tz."Zone" = 'East Harlem North'
AND lpep_dropoff_datetime::date BETWEEN '2019-10-01' AND '2019-10-31'
GROUP BY "PULocationID", "Zone"
"""

pd.read_sql(largest_tip, con=engine)

Unnamed: 0,PULocationID,Zone,max
0,74,East Harlem North,87.3


## Terraform

In this section homework we'll prepare the environment by creating resources in GCP with Terraform.

In your VM on GCP/Laptop/GitHub Codespace install Terraform. 
Copy the files from the course repo
[here](../../../01-docker-terraform/1_terraform_gcp/terraform) to your VM/Laptop/GitHub Codespace.

Modify the files as necessary to create a GCP Bucket and Big Query Dataset.


## Question 7. Terraform Workflow

Which of the following sequences, **respectively**, describes the workflow for: 
1. Downloading the provider plugins and setting up backend,
2. Generating proposed changes and auto-executing the plan
3. Remove all resources managed by terraform`