# Week 2 Homework

For the homework, we'll be working with the _green_ taxi dataset located here:

`https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green`

Mage project folders is in PROJECT_NAME=`hmwk-02`, with these folders pushed to github.

```
.
├── data_exporters
├── data_loaders
├── pipelines
├── transformers
```

In [None]:
!tree . -L 1

### Assignment

The goal will be to construct an ETL pipeline that loads the data, performs some transformations, and writes the data to a database (and Google Cloud!).

- Create a new pipeline, call it `green_taxi_etl`
- Add a data loader block and use Pandas to read data for the final quarter of 2020 (months `10`, `11`, `12`).
  - You can use the same datatypes and date parsing methods shown in the course.
  - `BONUS`: load the final three months using a for loop and `pd.concat`
- Add a transformer block and perform the following:
  - Remove rows where the passenger count is equal to 0 _or_ the trip distance is equal to zero.
  - Create a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date.
  - Rename columns in Camel Case to Snake Case, e.g. `VendorID` to `vendor_id`.
  - Add three assertions:
    - `vendor_id` is one of the existing values in the column (currently)
    - `passenger_count` is greater than 0
    - `trip_distance` is greater than 0
- Using a Postgres data exporter (SQL or Python), write the dataset to a table called `green_taxi` in a schema `mage`. Replace the table if it already exists.
- Write your data as Parquet files to a bucket in GCP, partioned by `lpep_pickup_date`. Use the `pyarrow` library!
- Schedule your pipeline to run daily at 5AM UTC.

Observe the pattern in the filenames.

https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2020-10.csv.gz
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2020-11.csv.gz
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2020-12.csv.gz

In [1]:
import io
import pandas as pd
import requests

Initial test for manual ingestion.

In [2]:
taxi_dtypes = {
            'VendorID': pd.Int64Dtype(),
            'passenger_count': pd.Int64Dtype(),
            'trip_distance': float,
            'RatecodeID':pd.Int64Dtype(),
            'store_and_fwd_flag':str,
            'PULocationID':pd.Int64Dtype(),
            'DOLocationID':pd.Int64Dtype(),
            'payment_type': pd.Int64Dtype(),
            'fare_amount': float,
            'extra':float,
            'mta_tax':float,
            'tip_amount':float,
            'tolls_amount':float,
            'improvement_surcharge':float,
            'total_amount':float,
            'congestion_surcharge':float
        }

# native date parsing 
parse_dates = ['lpep_pickup_datetime', 'lpep_dropoff_datetime']

months = [10, 11, 12]
year = 2020
colour = 'green' # service
base_url="https://github.com/DataTalksClub/nyc-tlc-data/releases/download"

# Create empty list to store DataFrames
dataframes = []

In [7]:
# manully looping through 10, 11, 12
# oct_2020, nov_2020, dec_2020

url='https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2020-12.csv.gz'

dec_2020 = pd.read_csv(
            url
            , sep=','
            , compression='gzip'
            , dtype=taxi_dtypes
            , parse_dates=parse_dates
        ) 


In [8]:
dec_2020.columns

Index(['VendorID', 'lpep_pickup_datetime', 'lpep_dropoff_datetime',
       'store_and_fwd_flag', 'RatecodeID', 'PULocationID', 'DOLocationID',
       'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax',
       'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge',
       'total_amount', 'payment_type', 'trip_type', 'congestion_surcharge'],
      dtype='object')

In [9]:
# Run above manually 3x before running this cell
print(dec_2020.shape)
print(nov_2020.shape)
print(oct_2020.shape)

(83130, 20)
(88605, 20)
(95120, 20)


In [10]:
dec_2020.shape[0] + nov_2020.shape[0] + oct_2020.shape[0]

266855

Using a for-loop for ingestion.
Succesful code used in [data_loaders load_api block](./hmwk-02/data_loaders/load_api_green_data.py)

In [11]:
# Iterate through months and download data
for month in months:
    print(month)
    
    filename = f"{colour}_tripdata_{year}-{month:02d}.csv.gz"
    print(filename)

    url = f"{base_url}/{colour}/{filename}"
    print(url)
    
    response = requests.get(url, stream=True)

    if response.status_code == 200:
      df = pd.read_csv(
         url
         , sep=','
         , compression='gzip'
         , dtype=taxi_dtypes
         , parse_dates=parse_dates
      ) 

      # Append DataFrame to the list
      dataframes.append(df)
      print(f"Downloaded {filename} successfully!")
      
    else:
      print(f"Failed to download {filename}. Status code: {response.status_code}")


10
green_tripdata_2020-10.csv.gz
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2020-10.csv.gz
Downloaded green_tripdata_2020-10.csv.gz successfully!
11
green_tripdata_2020-11.csv.gz
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2020-11.csv.gz
Downloaded green_tripdata_2020-11.csv.gz successfully!
12
green_tripdata_2020-12.csv.gz
https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2020-12.csv.gz
Downloaded green_tripdata_2020-12.csv.gz successfully!


In [12]:
# Concatenate DataFrames
combined_df = pd.concat(dataframes, ignore_index=True)


In [14]:
combined_df.sample(10)

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
192532,2.0,2020-12-06 04:37:21,2020-12-06 05:07:23,N,1.0,82,145,2.0,4.02,20.0,0.5,0.5,0.0,0.0,,0.3,21.3,2.0,1.0,0.0
237938,,2020-12-07 07:59:00,2020-12-07 08:34:00,,,98,140,,16.11,39.19,0.0,0.0,2.75,6.12,,0.3,48.36,,,
124061,2.0,2020-11-18 12:02:06,2020-11-18 12:13:41,N,1.0,244,239,1.0,4.81,16.0,0.0,0.5,5.86,0.0,,0.3,25.41,1.0,1.0,2.75
7205,1.0,2020-10-05 11:16:56,2020-10-05 11:20:54,N,1.0,7,226,1.0,0.6,5.0,0.0,0.5,0.0,0.0,,0.3,5.8,2.0,1.0,0.0
4281,2.0,2020-10-03 12:38:50,2020-10-03 13:22:08,N,1.0,39,83,1.0,18.94,54.0,0.0,0.5,2.75,0.0,,0.3,57.55,1.0,1.0,0.0
103309,1.0,2020-11-05 20:39:39,2020-11-05 20:45:37,N,1.0,97,97,1.0,0.7,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,2.0,1.0,0.0
182544,,2020-11-30 08:33:00,2020-11-30 09:20:00,,,35,140,,15.18,66.83,0.0,0.0,2.75,6.12,,0.3,76.0,,,
106484,2.0,2020-11-07 18:27:51,2020-11-07 18:35:14,N,1.0,66,52,1.0,1.45,7.0,0.0,0.5,1.95,0.0,,0.3,9.75,1.0,1.0,0.0
59285,,2020-10-06 15:36:00,2020-10-06 15:54:00,,,17,62,,1.76,16.73,0.0,0.0,2.75,0.0,,0.3,19.78,,,
123745,1.0,2020-11-18 09:00:07,2020-11-18 09:26:40,N,1.0,129,255,1.0,5.1,19.5,0.0,0.5,4.05,0.0,,0.3,24.35,1.0,1.0,0.0


### Questions

## Question 1. Data Loading

### Answer 1: `266,855 rows x 20 columns`

Once the dataset is loaded, what's the shape of the data?

* 266,855 rows x 20 columns
* 544,898 rows x 18 columns
* 544,898 rows x 20 columns
* 133,744 rows x 20 columns

In [15]:
# Print the final DataFrame
print(combined_df.shape)

(266855, 20)


## Question 2. Data Transformation

### Answer 2: `139,370 rows`

Upon filtering the dataset where the passenger count is equal to 0 _or_ the trip distance is equal to zero, how many rows are left?

* 544,897 rows
* 266,855 rows
* 139,370 rows
* 266,856 rows

In [17]:
import re

def camel_to_snake(name):
    # Replace lowercase-uppercase transitions with underscores
    name = re.sub(r'(?<=[a-z])(?=[A-Z])', '_', name)
    return name.lower()

# clean column names, make all lowercase and convert to snake_case
print(combined_df.shape)
combined_df.columns = combined_df.columns.map(camel_to_snake)

# create new column of date dtype for 'lpep_pickup_date' from 'lpep_pickup_datetime'
combined_df['lpep_pickup_date'] = combined_df['lpep_pickup_datetime'].dt.date

# drop records of rides with no passengers
print(f"Rows with out passengers: {combined_df['passenger_count'].fillna(0).isin([0]).sum() }")
combined_df = combined_df[combined_df['passenger_count'] > 0]

# drop records of rides with 0 trip_distance
print(f"Rows with 0 trip_distance: {combined_df['trip_distance'].fillna(0).isin([0]).sum() }")
combined_df = combined_df[combined_df['trip_distance'] > 0]

print(combined_df.shape)

(266855, 20)
Rows with out passengers: 120123
Rows with 0 trip_distance: 7362
(139370, 21)


## Question 3. Data Transformation

### Answer 3: `data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt.date`

Which of the following creates a new column `lpep_pickup_date` by converting `lpep_pickup_datetime` to a date?

* data = data['lpep_pickup_datetime'].date
* data('lpep_pickup_date') = data['lpep_pickup_datetime'].date
* data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt.date
* data['lpep_pickup_date'] = data['lpep_pickup_datetime'].dt().date()

## Question 4. Data Transformation

### Answer 4: `1 or 2`

What are the existing values of `VendorID` in the dataset?

* 1, 2, or 3
* 1 or 2
* 1, 2, 3, 4
* 1

In [18]:
combined_df.vendor_id.unique()

<IntegerArray>
[2, 1]
Length: 2, dtype: Int64

In [19]:
combined_df.vendor_id.value_counts()

vendor_id
2    117408
1     21962
Name: count, dtype: Int64

## Question 5. Data Transformation

### Answer 5: `4`

How many columns need to be renamed to snake case?

* 3
* 6
* 2
* 4

In [20]:
dec_2020.columns

Index(['VendorID', 'lpep_pickup_datetime', 'lpep_dropoff_datetime',
       'store_and_fwd_flag', 'RatecodeID', 'PULocationID', 'DOLocationID',
       'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax',
       'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge',
       'total_amount', 'payment_type', 'trip_type', 'congestion_surcharge'],
      dtype='object')

In [21]:
camels = ['VendorID', 'RatecodeID', 'PULocationID', 'DOLocationID']
print(len(camels))

4


## Question 6. Data Exporting

### Answer 6: `96`

Once exported, how many partitions (folders) are present in Google Cloud?

* 96
* 56
* 67
* 108


## Connect to BigQuery 

Note `sqlalchemy-bigquery 1.9.0` is only compatible with SQLAlchemy versions < 2.0.0. <br>
Not worth it to downgrade?
[source](https://pypi.org/project/sqlalchemy-bigquery/)

[google-cloud-python](https://github.com/googleapis/google-cloud-python/tree/main)

[run-google-bigquery-sql-with-vscode](https://developers.lseg.com/en/article-catalog/article/run-google-bigquery-sql-with-vscode)

[pandas-gbq alternative to google.cloud](https://pandas-gbq.readthedocs.io/en/latest/)

In [None]:
# !pip show sqlalchemy

In [None]:
from google.cloud import bigquery
client = bigquery.Client(project='nyc-rides-ella')
import pandas as pd

### Yellow_cab_data

In [None]:
query = """
    SELECT 
        date(tpep_pickup_datetime) as pickup_date, 
        SUM(passenger_count) as total_passenger_count,
        MAX(trip_distance) as max_trip_distance,
        COUNT(*) as number_of_trips
    FROM nyc-rides-ella.ny_taxi.yellow_cab_data 
    GROUP BY 
        pickup_date
    ORDER BY number_of_trips DESC
    LIMIT 1000
"""

df_yellow = client.query(query).to_dataframe()
df_yellow.shape

# WORKS!
# rows = client.query_and_wait(query)  # Make an API request.

# print("The query data:")
# for row in rows:
#     # Row values can be accessed by field name or index.
#     print("date={}, number_of_trips={}".format(row[0], row["number_of_trips"]))

In [None]:
df_yellow

### green_taxi

In [None]:
from google.cloud import bigquery
client = bigquery.Client(project='nyc-rides-ella')
import pandas as pd

query = """
    SELECT 
        *
    FROM nyc-rides-ella.mage.green_taxi
    LIMIT 10
"""
df_green = client.query(query).to_dataframe()
df_green.shape


In [None]:
query = """
    SELECT 
        EXTRACT(YEAR FROM lpep_pickup_datetime) AS Year,
        EXTRACT(MONTH FROM lpep_pickup_datetime) AS Month,
        COUNT(*) as number_of_trips
    FROM nyc-rides-ella.mage.green_taxi
    GROUP BY Year, Month
    ORDER BY
        number_of_trips DESC
"""
df_green = client.query(query).to_dataframe()
df_green.shape


In [None]:
df_green

Weird outliers/wrong-data there (a trip from `2009`!). Need more cleanup before going forward.


## Verify all answers from direct BigQuery queries

### Qn 1

This is only visible from Pipeline  run of `load_data_from_api` DATA LOADER

### Qn 2

In [None]:
query = """
    SELECT  *
    FROM nyc-rides-ella.mage.green_taxi
"""
df_green = client.query(query).to_dataframe()
df_green.shape

All other questions are not reliant on SQL querying of the database, after all.