# NYC Taxi Trips Example

This data is freely available. You can find some interesting background information at https://chriswhong.com/open-data/foil_nyc_taxi/ . We will use this data to perform some analytical tasks. The whole wotkshop is split up into multiple sections, which represents the typical data processing flow in a data centric project. We will follow the (simplified) steps when using a data lake.

1. Build "Structured Zone" containing all sources
2. Build "Refined Zone" that contains pre-processed data
3. Analyze the data before working on the next steps to find an appropriate approach
4. Build "Integrated Zone" that contains integrated data
5. Use Machine Learning for business questions

## Requirements

The workshop will require the following Python packages:

* PySpark (tested with Spark 2.4)
* Matplotlib
* Pandas
* GeoPandas
* Cartopy
* Contextily

# Part 1 - Build Structured Zone

The first part is about building the structured zone. It will contain a copy of the raw data stored in Hive tables and thereby easily accessible for downstream processing.

In [None]:
taxi_basedir = "s3://dimajix-training/data/nyc-taxi-trips/"
weather_basedir = "s3://dimajix-training/data/weather/"
holidays_basedir = "s3://dimajix-training/data/bank-holidays/"

# 0 Create Spark Session

Before we begin, we create a Spark session if none was provided in the notebook.

In [None]:
import pyspark.sql.functions as f
from pyspark.sql import SparkSession


if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","64G") \
        .getOrCreate()

spark

# 1 Taxi Data

This data is freely available. You can find some interesting background information at https://chriswhong.com/open-data/foil_nyc_taxi/ . In the first step we read in the raw data. The data is split into two different entities: Basic trip information and payment information. We will store the data in a more efficient representation (Parquet) to form the structured zone.

## 1.1 Trip Information

We start with reading in the trip information. It contains the following columns
* **medallion** - This is some sort of a license for a taxi company. A single medallion is attached to a single cab and may be used by multiple drivers.
* **hack_license** - This is the drivers license
* **vendor_id**
* **rate_code** The final rate code in effect at the end of the trip. 
  * 1=Standard rate
  * 2=JFK
  * 3=Newark
  * 4=Nassau or Westchester
  * 5=Negotiated fare
  * 6=Group ride
* **store_and_fwd_flag** This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server
* **pickup_datetime** This is the time when a passenger was picked up
* **dropoff_datetime** This is the time when the passenger was dropped off again
* **passenger_count** Number of passengers of this trip
* **trip_time_in_secs**
* **trip_distance**
* **pickup_longitude**
* **pickup_latitude**
* **dropoff_longitude**
* **dropoff_latitude**

The primary key uniquely identifying each trip is given by the columns `medallion`, `hack_license`, `vendor_id` and `pickip_datetime`.

In [None]:
from pyspark.sql.types import *


trip_schema = StructType([
    StructField('medallion', StringType()),
    StructField('hack_license', StringType()),
    StructField('vendor_id', StringType()),
    StructField('rate_code', StringType()),
    StructField('store_and_fwd_flag', StringType()),
    StructField('pickup_datetime', TimestampType()),
    StructField('dropoff_datetime', TimestampType()),
    StructField('passenger_count', IntegerType()),
    StructField('trip_time_in_secs', IntegerType()),
    StructField('trip_distance', DoubleType()),
    StructField('pickup_longitude', DoubleType()),
    StructField('pickup_latitude', DoubleType()),
    StructField('dropoff_longitude', DoubleType()),
    StructField('dropoff_latitude', DoubleType()),
    ])

# Read in the data into a PySpark DataFrame using the schema above
#  location: taxi_basedir + "/data/"
#  schema: trip_schema
#  format: csv
#  header: True
trip_data = # YOUR CODE HERE

Inspect the first 10 rows by converting them to a Pandas DataFrame.

In [None]:
# YOUR CODE HERE

### Inspect Schema

Just to be sure, let us inspect the schema. It should match exactly the specified one.

In [None]:
# YOUR CODE HERE

### Write into Structured Zone

Now we store data into Hive tables as parquet files. In order to do that, we first create an empty Hive database "taxi", in order to reflect the source of the data.

In [None]:
spark.sql("CREATE DATABASE IF NOT EXISTS taxi")

In [None]:
# Write the DataFrame trip_data into Hive by using the method `saveAsTable`
#   format: parquet
#   Hive table: taxi.trip

# YOUR CODE HERE

## 1.2 Fare information

Now we read in the second table containing the trips fare information.

* **medallion** - This is some sort of a license for a taxi company
* **hack_license** - This is the drivers license
* **vendor_id**
* **pickup_datetime** This is the time when a passenger was picked up
* **payment_type** A numeric code signifying how the passenger paid for the trip. 
  * CRD = Credit card
  * CDH = Cash
  * ??? = No charge
  * ??? = Dispute
  * ??? = Unknown
  * ??? = Voided trip
* **fare_amount** The time-and-distance fare calculated by the meter
* **surcharge**
* **mta_tax** $0.50 MTA tax that is automatically triggered based on the metered rate in use
* **tip_amount** Tip amount –This field is automatically populated for credit card tips. Cash tips are not included
* **tolls_amount** Total amount of all tolls paid in trip
* **total_amount** The total amount charged to passengers. Does not include cash tips.

In [None]:
fare_schema = StructType([
    StructField('medallion', StringType()),
    StructField('hack_license', StringType()),
    StructField('vendor_id', StringType()),
    StructField('pickup_datetime', TimestampType()),
    StructField('payment_type', StringType()),
    StructField('fare_amount', DoubleType()),
    StructField('surcharge', DoubleType()),
    StructField('mta_tax', DoubleType()),
    StructField('tip_amount', DoubleType()),
    StructField('tolls_amount', DoubleType()),
    StructField('total_amount', DoubleType())
    ])

# Read in the Taxi fare information into a PySpark DataFrame trip_fare
#  location: taxi_basedir + "/fare/"
#  schema: fare_schema
#  format: csv
#  option 'header': True
#  option 'ignoreLeadingWhiteSpace': True
trip_fare = # YOUR CODE HERE

In [None]:
trip_fare.limit(10).toPandas()

### Inspect Schema

Let us inspect the schema of the data, which should match exactly the schema that we originally specified

In [None]:
trip_fare.printSchema()

### Store into Structured Zone

Finally store the data into the structured zone as Parquet files into the a different Hive table `taxi.fare`

In [None]:
trip_fare.write.format("parquet").saveAsTable("taxi.fare")

# 2. Weather Data

In order to improve our analysis, we will relate the taxi trips with weather information. We use the NOAA ISD weather data (https://www.ncdc.noaa.gov/isd), which contains measurements from many stations around the world, some of them dating back to 1901. You can download all data from ftp://ftp.ncdc.noaa.gov/pub/data/noaa . We will only use a small subset of the data which is good enough for our purposes.

## 2.1 Station Master Data

The weather data is split up into two different data sets: the measurements themselves and meta data about the stations. The later contains valuable information like the geo location of the weather station. This will be useful when trying to find the weather station nearest to all taxi trips.

Among other data the columns provide specifically the following informations
* **USAF** & **WBAN** - weather station id
* **CTRY** - the country of the weather station
* **STATE** - the state of the weather station
* **LAT** & **LONG** - latitude and longitude of the weather station (geo coordinates)
* **BEGIN** & **END** - date range when this weather station was active

In [None]:
# Read in weather station master data into a PySpark DataFrame weather_stations
#  location: weather_basedir + "/isd-history/"
#  format: csv
#  header: True
weather_stations = # YOUR CODE HERE

weather_stations.limit(10).toPandas()

### Store data into Structured Zone

In the next step we want to store the data as Parquet files (which are much more efficient and very well supported by most batch frameworks in the Hadoop and Spark universe). In order to do so, we first need to rename some columns, which contain unsupported characters:
* "STATION NAME" => "STATION_NAME"
* "ELEV(M)" => "ELEVATION"

After the columns have been renamed, the data frame is written into the structured zone into the Hive table `isd.stations` using the `DataFrame.write.saveAsTable` function. But we also need to take care of creating the Hive database `isd` first.

In [1]:
# Create Hive database "isd" using spark.sql(...)
# YOUR CODE HERE

In [None]:
# Write stations into Hive table "isd.stations"
weather_stations \
    .withColumnRenamed("STATION NAME", "STATION_NAME") \
    .withColumnRenamed("ELEV(M)", "ELEVATION") \
    .write.format("parquet").saveAsTable("isd.stations")

### Read in data agin

Using the `spark.read.table` function we read in the data back into Spark and display some records.

In [None]:
weather_stations = # YOUR CODE HERE
weather_stations.limit(10).toPandas()

## 2.2 Weather Measurements

Now we will work with the second and more interesting part of the NOAA weather data set: The measurements. These are stored in different subdirectories per year. For us, the year 2013 is good enough, since the taxi trips are all from 2013.

The data format is a proprietary ASCII encoding, so we use the `spark.read.text` method to read each line as one record.

In [None]:
# Read raw measurements into PySpark DataFrame raw_weather
#  location: weather_basedir + "/2013"
#  format: text
raw_weather = # YOUR CODE HERE
raw_weather.limit(10).toPandas()

### Extract precipitation

Now we extract the precipitation from the measurements. This is not trivial, since that information is stored in a variable part. We assume that the record contains precipitation data when it contains the substring `AA1` at position 109. This denotes the type of the subsection in the data record followed by the number of hours of this measurement and the precipitation depth.

We use some PySpark string functions to extract the data.

In [None]:
raw_weather.select(
        f.substring(raw_weather["value"],106,999),
        f.instr(raw_weather["value"],"AA1").alias("s"),
        f.when(f.instr(raw_weather["value"],"AA1") == 109,f.substring(raw_weather["value"], 109+3, 8)).alias("AAD")
    )\
    .withColumn("precipitation_hours", f.substring(f.col("AAD"), 1, 2).cast("INT")) \
    .withColumn("precipitation_depth", f.substring(f.col("AAD"), 3, 4).cast("FLOAT")) \
    .filter(f.col("precipitation_depth") > 0) \
    .limit(10).toPandas()

### Extract all relevant measurements

The precipitation was the hardest part. Other measurements like wind speed and air temperature are stored at fixed positions together with some quality flags denoting if a measurement is valid. In the following statement, we extract all relevant measurements. Specifically we extract the following information
* **USAF** & **WBAN** - weather station identifier
* **ts** - timestamp of measurement
* **wind_direction** - wind direction (in degrees)
* **wind_direction_qual** - quality flag of the wind direction
* **wind_speed** - wind speed
* **wind_speed_qual** - quality flag indicating the validity of the wind speed
* **air_temperature** - air temperature in degree Celsius
* **air_temperature_qual** - quality flag for air temperature
* **precipitation_hours**
* **precipitation_depth**

In [None]:
weather = raw_weather.select(
        f.substring(raw_weather["value"],5,6).alias("usaf"),
        f.substring(raw_weather["value"],11,5).alias("wban"),
        f.to_timestamp(f.substring(raw_weather["value"],16,12), "yyyyMMddHHmm").alias("ts"),
        f.substring(raw_weather["value"],42,5).alias("report_type"),
        f.substring(raw_weather["value"],61,3).alias("wind_direction"),
        f.substring(raw_weather["value"],64,1).alias("wind_direction_qual"),
        f.substring(raw_weather["value"],65,1).alias("wind_observation"),
        (f.substring(raw_weather["value"],66,4).cast("float") / 10.0).alias("wind_speed"),
        f.substring(raw_weather["value"],70,1).alias("wind_speed_qual"),
        (f.substring(raw_weather["value"],88,5).cast("float") / 10.0).alias("air_temperature"),
        f.substring(raw_weather["value"],93,1).alias("air_temperature_qual"),
        f.when(f.instr(raw_weather["value"],"AA1") == 109,f.substring(raw_weather["value"], 109+3, 8)).alias("AAD")
    ) \
    .withColumn("precipitation_hours", f.substring(f.col("AAD"), 1, 2).cast("INT")) \
    .withColumn("precipitation_depth", f.substring(f.col("AAD"), 3, 4).cast("FLOAT")) \
    .withColumn("date", f.to_date(f.col("ts"))) \
    .drop("AAD")
    
weather.limit(10).toPandas()

### Store into Structured Zone

After successful extraction, we write the result again into the structured zone into Hive table `isd.weather`. Since we originally have weather measurements for different years, we create a partitioned Hive table - although we are only interested in the weather of 2013. Unfortunately the support for writing into partitioned Hive tables is currently rather limited within the PySpark API. But we can perform everything using Spark SQL instead of the PySpark API.

#### Create partitioned table

As mentioned above, we cannot easily create a partitioned table using the PySpark API. But we can create one using Spark SQL. We create the required SQL statement dynamically from the given schema (but with a fixed partition column)

In [None]:
columns = [f.name + " " + f.dataType.simpleString() for f in weather.schema.fields]
sql = "CREATE TABLE IF NOT EXISTS isd.weather(" + ",".join(columns) + ") PARTITIONED BY(year INT)"
print(sql)

spark.sql(sql)

#### Store data into partition

In order to write into a single partition, we again need to use Spark SQL. Therefore we register the weather data of 2013 as a namedtemporary view for SQL access and then insert all records into one specific partition.

In [None]:
weather.createOrReplaceTempView("weather_2013")

spark.sql("""
    INSERT OVERWRITE TABLE isd.weather PARTITION(year=2013)
    SELECT * FROM weather_2013
""")

### Read in from Structured Zone

Again we read back the data from the Hive table `isd.weather` and display 10 first records.

In [None]:
## YOUR CODE HERE

# 3. Holidays

Another important data source is additional date information, specifically if a certain date is a bank holiday. While other information like week days can be directly computed from a date, for bank holidays an additional source is required.

We follow again the same approach of reading in the raw data and storing it into the structured zone as Parquet files.

In [None]:
holidays_schema = StructType([
    StructField('id', IntegerType()),
    StructField('date', DateType()),
    StructField('description', StringType()),
    StructField('bank_holiday', BooleanType())
    ])

# Read in holidays file into PySpark DataFrame holidays
#  location: holidays_basedir
#  schema: holidays_schema
#  format: csv
#  header: False
holidays = # YOUR CODE HERE

# Peek inside the data
holidays.limit(10).toPandas()

Again let us inspect the schema

In [None]:
holidays.printSchema()

### Store into Structured Zone

Same game. Let us create a new database `ref` for simple reference table and let us store the holidays into a table `ref.holidays`.

In [None]:
# Create Hive database ref
# YOUR CODE HERE

In [None]:
# Store holidays into Hive table "ref.holidays"
# YOUR CODE HERE

### Read in from Structured Zone

Again let us check if writing was successful.

In [None]:
holidays = spark.read.table("ref.holidays")
holidays.limit(10).toPandas()