# Introduction to Taxi ETL Job
This is the Taxi ETL job to generate the input datasets for the Taxi XGBoost job.

## Prerequirement
### 1. Download data
All data could be found at https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

### 2. Download needed jars
* [rapids-4-spark_2.12-23.10.0.jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.10.0/rapids-4-spark_2.12-23.10.0.jar)

### 3. Start Spark Standalone
Before running the script, please setup Spark standalone mode

### 4. Add ENV
```
$ export SPARK_JARS=rapids-4-spark_2.12-23.10.0.jar
$ export PYSPARK_DRIVER_PYTHON=jupyter 
$ export PYSPARK_DRIVER_PYTHON_OPTS=notebook
```

### 5. Start Jupyter Notebook with plugin config

```
$ pyspark --master ${SPARK_MASTER}            \
--jars ${SPARK_JARS}                \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.incompatibleDateFormats.enabled=true \
--conf spark.rapids.sql.csv.read.double.enabled=true \
--py-files ${SPARK_PY_FILES}
```

## Import Libs

In [1]:
import time
import os
import math
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

## Script Settings

###  File Path Settings
* Define input/output file path

In [2]:
# You need to update them to your real paths! You can download the dataset 
# from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
# or you can just unzip datasets/taxi-small.tar.gz and use the provided
# sample dataset datasets/taxi/taxi-etl-input-small.csv
dataRoot = os.getenv('DATA_ROOT', '/data')
rawPath = dataRoot + '/taxi/taxi-etl-input-small.csv'
outPath = dataRoot + '/taxi/output'

## Function and Object Define
### Define the constants

* Define input file schema

In [3]:
raw_schema = StructType([
    StructField('vendor_id', StringType()),
    StructField('pickup_datetime', StringType()),
    StructField('dropoff_datetime', StringType()),
    StructField('passenger_count', IntegerType()),
    StructField('trip_distance', DoubleType()),
    StructField('pickup_longitude', DoubleType()),
    StructField('pickup_latitude', DoubleType()),
    StructField('rate_code', StringType()),
    StructField('store_and_fwd_flag', StringType()),
    StructField('dropoff_longitude', DoubleType()),
    StructField('dropoff_latitude', DoubleType()),
    StructField('payment_type', StringType()),
    StructField('fare_amount', DoubleType()),
    StructField('surcharge', DoubleType()),
    StructField('mta_tax', DoubleType()),
    StructField('tip_amount', DoubleType()),
    StructField('tolls_amount', DoubleType()),
    StructField('total_amount', DoubleType()),
])

* Define some ETL functions

In [4]:
def drop_useless(data_frame):
    return data_frame.drop(
        'dropoff_datetime',
        'payment_type',
        'surcharge',
        'mta_tax',
        'tip_amount',
        'tolls_amount',
        'total_amount')

In [5]:
def encode_categories(data_frame):
    categories = [ 'vendor_id', 'rate_code', 'store_and_fwd_flag' ]
    for category in categories:
        data_frame = data_frame.withColumn(category, hash(col(category)))
    return data_frame.withColumnRenamed("store_and_fwd_flag", "store_and_fwd")

In [6]:
def fill_na(data_frame):
    return data_frame.fillna(-1)

In [7]:
def remove_invalid(data_frame):
    conditions = [
        ( 'fare_amount', 0, 500 ),
        ( 'passenger_count', 0, 6 ),
        ( 'pickup_longitude', -75, -73 ),
        ( 'dropoff_longitude', -75, -73 ),
        ( 'pickup_latitude', 40, 42 ),
        ( 'dropoff_latitude', 40, 42 ),
    ]
    for column, min, max in conditions:
        data_frame = data_frame.filter('{} > {} and {} < {}'.format(column, min, column, max))
    return data_frame

In [8]:
def convert_datetime(data_frame):
    datetime = col('pickup_datetime')
    return (data_frame
        .withColumn('pickup_datetime', to_timestamp(datetime))
        .withColumn('year', year(datetime))
        .withColumn('month', month(datetime))
        .withColumn('day', dayofmonth(datetime))
        .withColumn('day_of_week', dayofweek(datetime))
        .withColumn(
            'is_weekend',
            col('day_of_week').isin(1, 7).cast(IntegerType()))  # 1: Sunday, 7: Saturday
        .withColumn('hour', hour(datetime))
        .drop('pickup_datetime'))

In [9]:
def add_h_distance(data_frame):
    p = math.pi / 180
    lat1 = col('pickup_latitude')
    lon1 = col('pickup_longitude')
    lat2 = col('dropoff_latitude')
    lon2 = col('dropoff_longitude')
    internal_value = (0.5
        - cos((lat2 - lat1) * p) / 2
        + cos(lat1 * p) * cos(lat2 * p) * (1 - cos((lon2 - lon1) * p)) / 2)
    h_distance = 12734 * asin(sqrt(internal_value))
    return data_frame.withColumn('h_distance', h_distance)

* Define main ETL function

In [10]:
def pre_process(data_frame):
    processes = [
        drop_useless,
        encode_categories,
        fill_na,
        remove_invalid,
        convert_datetime,
        add_h_distance,
    ]
    for process in processes:
        data_frame = process(data_frame)
    return data_frame

## Run ETL Process and Save the Result
* Create Spark Session and create dataframe

In [11]:
spark = (SparkSession
    .builder
    .appName("Taxi-ETL")
    .getOrCreate())
reader = (spark
        .read
        .format('csv'))
reader.schema(raw_schema).option('header', 'True')

raw_data = reader.load(rawPath)

* Run ETL Process and Save the Result

In [12]:
start = time.time()
etled_train, etled_eval, etled_trans = pre_process(raw_data).randomSplit(list(map(float, (80,20,0))))
etled_train.write.mode("overwrite").parquet(outPath+'/train')
etled_eval.write.mode("overwrite").parquet(outPath+'/eval')
etled_trans.write.mode("overwrite").parquet(outPath+'/trans')
end = time.time()
print(end - start)
spark.stop()

5.114504098892212
