# Using Azure Open Datasets in Synapse - Enrich NYC Green Taxi Data with Holiday and Weather

Synapse has [Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/) package pre-installed. This notebook provides examples of how to enrich NYC Green Taxi Data with Holiday and Weather with focusing on :
- read Azure Open Dataset
- manipulate the data to prepare for further analysis, including column projection, filtering, grouping and joins etc. 
- create a Spark table to be used in other notebooks for modeling training

## Data loading 
Let's first load the [NYC green taxi trip records](https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-green-taxi-trip-records/). The Open Datasets package contains a class representing each data source (NycTlcGreen for example) to easily filter date parameters before downloading.

In [3]:
// Load nyc green taxi trip records from azure open dataset
val blob_account_name = "azureopendatastorage"

val nyc_blob_container_name = "nyctlc"
val nyc_blob_relative_path = "green"
val nyc_blob_sas_token = ""

val nyc_wasbs_path = f"wasbs://$nyc_blob_container_name@$blob_account_name.blob.core.windows.net/$nyc_blob_relative_path"
spark.conf.set(f"fs.azure.sas.$nyc_blob_container_name.$blob_account_name.blob.core.windows.net",nyc_blob_sas_token)

val nyc_tlc = spark.read.parquet(nyc_wasbs_path)

//nyc_tlc.show(5, truncate = false)

blob_account_name: String = azureopendatastorage
nyc_blob_container_name: String = nyctlc
nyc_blob_relative_path: String = green
nyc_blob_sas_token: String = ""
nyc_wasbs_path: String = wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/green
nyc_tlc: org.apache.spark.sql.DataFrame = [vendorID: int, lpepPickupDatetime: timestamp ... 23 more fields]

In [4]:
// Filter data by time range
import java.sql.Timestamp
import org.joda.time.DateTime

val end_date = new Timestamp(DateTime.parse("2018-06-06").getMillis)
val start_date = new Timestamp(DateTime.parse("2018-05-01").getMillis)

val nyc_tlc_df = nyc_tlc.filter((nyc_tlc("lpepPickupDatetime") >= start_date) && (nyc_tlc("lpepPickupDatetime") <= end_date)) 
nyc_tlc_df.show(5, truncate = false)

import java.sql.Timestamp
import org.joda.time.DateTime
end_date: java.sql.Timestamp = 2018-06-06 00:00:00.0
start_date: java.sql.Timestamp = 2018-05-01 00:00:00.0
nyc_tlc_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [vendorID: int, lpepPickupDatetime: timestamp ... 23 more fields]
+--------+-------------------+-------------------+--------------+------------+------------+------------+---------------+--------------+----------------+---------------+----------+---------------+-----------+----------+-----+------+--------------------+---------+-----------+--------+-----------+--------+------+-------+
|vendorID|lpepPickupDatetime |lpepDropoffDatetime|passengerCount|tripDistance|puLocationId|doLocationId|pickupLongitude|pickupLatitude|dropoffLongitude|dropoffLatitude|rateCodeID|storeAndFwdFlag|paymentType|fareAmount|extra|mtaTax|improvementSurcharge|tipAmount|tollsAmount|ehailFee|totalAmount|tripType|puYear|puMonth|
+--------+-------------------+-------------------+-----------

Now that the initial data is loaded. Let's do some projection on the data to 
- create new columns for the month number, day of month, day of week, and hour of day. These info is going to be used in the training model to factor in time-based seasonality.
- add a static feature for the country code to join holiday data. 

In [5]:
// Extract month, day of month, and day of week from pickup datetime and add a static column for the country code to join holiday data. 
import org.apache.spark.sql.functions._

val nyc_tlc_df_expand = (
                        nyc_tlc_df.withColumn("datetime", to_date(col("lpepPickupDatetime")))
                                  .withColumn("month_num",month(col("lpepPickupDatetime")))
                                  .withColumn("day_of_month",dayofmonth(col("lpepPickupDatetime")))
                                  .withColumn("day_of_week",dayofweek(col("lpepPickupDatetime")))
                                  .withColumn("hour_of_day",hour(col("lpepPickupDatetime")))
                                  .withColumn("country_code",lit("US"))
                        )

import org.apache.spark.sql.functions._
nyc_tlc_df_expand: org.apache.spark.sql.DataFrame = [vendorID: int, lpepPickupDatetime: timestamp ... 29 more fields]

In [6]:
// Display 5 rows
// nyc_tlc_df_expand.show(5, truncate = false)

Remove some of the columns that won't need for modeling or additional feature building.




In [7]:
// Remove unused columns from nyc green taxi data
val nyc_tlc_df_clean = nyc_tlc_df_expand.drop(
                    "lpepDropoffDatetime", "puLocationId", "doLocationId", "pickupLongitude", 
                     "pickupLatitude", "dropoffLongitude","dropoffLatitude" ,"rateCodeID", 
                     "storeAndFwdFlag","paymentType", "fareAmount", "extra", "mtaTax",
                     "improvementSurcharge", "tollsAmount", "ehailFee", "tripType" )

nyc_tlc_df_clean: org.apache.spark.sql.DataFrame = [vendorID: int, lpepPickupDatetime: timestamp ... 12 more fields]

In [8]:
// Display 5 rows
nyc_tlc_df_clean.show(5, truncate = false)

+--------+-------------------+--------------+------------+---------+-----------+------+-------+----------+---------+------------+-----------+-----------+------------+
|vendorID|lpepPickupDatetime |passengerCount|tripDistance|tipAmount|totalAmount|puYear|puMonth|datetime  |month_num|day_of_month|day_of_week|hour_of_day|country_code|
+--------+-------------------+--------------+------------+---------+-----------+------+-------+----------+---------+------------+-----------+-----------+------------+
|2       |2018-05-18 20:32:54|1             |6.73        |0.0      |27.3       |2018  |5      |2018-05-18|5        |18          |6          |20         |US          |
|2       |2018-05-25 20:16:31|1             |0.51        |0.0      |5.3        |2018  |5      |2018-05-25|5        |25          |6          |20         |US          |
|2       |2018-05-25 20:26:02|1             |2.6         |0.0      |12.8       |2018  |5      |2018-05-25|5        |25          |6          |20         |US          

## Enrich with holiday data
Now that we have taxi data downloaded and roughly prepared, add in holiday data as additional features. Holiday-specific features will assist model accuracy, as major holidays are times where taxi demand increases dramatically and supply becomes limited. 

Let's load the [public holidays](https://azure.microsoft.com/en-us/services/open-datasets/catalog/public-holidays/) from Azure Open datasets.


In [9]:
// Load public holidays data from azure open dataset
val hol_blob_container_name = "holidaydatacontainer"
val hol_blob_relative_path = "Processed"
val hol_blob_sas_token = ""

val hol_wasbs_path = f"wasbs://$hol_blob_container_name@$blob_account_name.blob.core.windows.net/$hol_blob_relative_path"
spark.conf.set(f"fs.azure.sas.$hol_blob_container_name.$blob_account_name.blob.core.windows.net",hol_blob_sas_token)

val hol_raw = spark.read.parquet(hol_wasbs_path)

// Filter data by time range
val hol_df = hol_raw.filter((hol_raw("date") >= start_date) && (hol_raw("date") <= end_date))

// Display 5 rows
// hol_df.show(5, truncate = false)

hol_blob_container_name: String = holidaydatacontainer
hol_blob_relative_path: String = Processed
hol_blob_sas_token: String = ""
hol_wasbs_path: String = wasbs://holidaydatacontainer@azureopendatastorage.blob.core.windows.net/Processed
hol_raw: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 4 more fields]
hol_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [countryOrRegion: string, holidayName: string ... 4 more fields]

Rename the countryRegionCode and date columns to match the respective field names from the taxi data, and also normalize the time so it can be used as a key. 

In [10]:
val hol_df_clean = (
                hol_df.withColumnRenamed("countryRegionCode","country_code")
                .withColumn("datetime",to_date(col("date")))
                )

hol_df_clean.show(5, truncate = false)

hol_df_clean: org.apache.spark.sql.DataFrame = [countryOrRegion: string, holidayName: string ... 5 more fields]
+---------------+----------------------------+----------------------------+-------------+------------+-------------------+----------+
|countryOrRegion|holidayName                 |normalizeHolidayName        |isPaidTimeOff|country_code|date               |datetime  |
+---------------+----------------------------+----------------------------+-------------+------------+-------------------+----------+
|Argentina      |Día del Trabajo [Labour Day]|Día del Trabajo [Labour Day]|null         |AR          |2018-05-01 00:00:00|2018-05-01|
|Austria        |Staatsfeiertag              |Staatsfeiertag              |null         |AT          |2018-05-01 00:00:00|2018-05-01|
|Belarus        |Праздник труда              |Праздник труда              |null         |BY          |2018-05-01 00:00:00|2018-05-01|
|Belgium        |Dag van de Arbeid           |Dag van de Arbeid           |null     

Next, join the holiday data with the taxi data by performing a left-join. This will preserve all records from taxi data, but add in holiday data where it exists for the corresponding datetime and country_code, which in this case is always "US". Preview the data to verify that they were merged correctly.

In [11]:
// enrich taxi data with holiday data
val nyc_taxi_holiday_df = nyc_tlc_df_clean.join(hol_df_clean, Seq("datetime", "country_code") , "left")

nyc_taxi_holiday_df.show(5,truncate = false)

nyc_taxi_holiday_df: org.apache.spark.sql.DataFrame = [datetime: date, country_code: string ... 17 more fields]
+----------+------------+--------+-------------------+--------------+------------+---------+-----------+------+-------+---------+------------+-----------+-----------+---------------+-----------+--------------------+-------------+----+
|datetime  |country_code|vendorID|lpepPickupDatetime |passengerCount|tripDistance|tipAmount|totalAmount|puYear|puMonth|month_num|day_of_month|day_of_week|hour_of_day|countryOrRegion|holidayName|normalizeHolidayName|isPaidTimeOff|date|
+----------+------------+--------+-------------------+--------------+------------+---------+-----------+------+-------+---------+------------+-----------+-----------+---------------+-----------+--------------------+-------------+----+
|2018-05-18|US          |2       |2018-05-18 20:32:54|1             |6.73        |0.0      |27.3       |2018  |5      |5        |18          |6          |20         |null           |n

In [12]:
// Create a temp table and filter out non empty holiday rows

nyc_taxi_holiday_df.createOrReplaceTempView("nyc_taxi_holiday_df")
val result = spark.sql("SELECT * from nyc_taxi_holiday_df WHERE holidayName is NOT NULL ")
result.show(5, truncate = false)

result: org.apache.spark.sql.DataFrame = [datetime: date, country_code: string ... 17 more fields]
+----------+------------+--------+-------------------+--------------+------------+---------+-----------+------+-------+---------+------------+-----------+-----------+---------------+------------+--------------------+-------------+-------------------+
|datetime  |country_code|vendorID|lpepPickupDatetime |passengerCount|tripDistance|tipAmount|totalAmount|puYear|puMonth|month_num|day_of_month|day_of_week|hour_of_day|countryOrRegion|holidayName |normalizeHolidayName|isPaidTimeOff|date               |
+----------+------------+--------+-------------------+--------------+------------+---------+-----------+------+-------+---------+------------+-----------+-----------+---------------+------------+--------------------+-------------+-------------------+
|2018-05-28|US          |2       |2018-05-28 10:28:09|1             |2.01        |2.26     |13.56      |2018  |5      |5        |28          |2     

## Enrich with weather data¶

Now we append NOAA surface weather data to the taxi and holiday data. Use a similar approach to fetch the [NOAA weather history data](https://azure.microsoft.com/en-us/services/open-datasets/catalog/noaa-integrated-surface-data/) from Azure Open Datasets. 

In [13]:
// Load weather data from azure open dataset
val weather_blob_container_name = "isdweatherdatacontainer"
val weather_blob_relative_path = "ISDWeather/"
val weather_blob_sas_token = ""

val weather_wasbs_path = f"wasbs://$weather_blob_container_name@$blob_account_name.blob.core.windows.net/$weather_blob_relative_path"
spark.conf.set(f"fs.azure.sas.$weather_blob_container_name.$blob_account_name.blob.core.windows.net",hol_blob_sas_token)

val isd = spark.read.parquet(weather_wasbs_path)

// Display 5 rows
// isd.show(5, truncate = false)

weather_blob_container_name: String = isdweatherdatacontainer
weather_blob_relative_path: String = ISDWeather/
weather_blob_sas_token: String = ""
weather_wasbs_path: String = wasbs://isdweatherdatacontainer@azureopendatastorage.blob.core.windows.net/ISDWeather/
isd: org.apache.spark.sql.DataFrame = [usaf: string, wban: string ... 21 more fields]

In [14]:
// Filter data by time range
val isd_df = isd.filter((isd("datetime") >= start_date) && (isd("datetime") <= end_date))

// Display 5 rows
isd_df.show(5, truncate = false)

isd_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [usaf: string, wban: string ... 21 more fields]
+------+-----+-------------------+--------+---------+---------+---------+---------+-----------+--------------+-------------+-----------------------+--------------------+----------+-----------+---------+-------------+---------------+------------+----+---+-------+-----+
|usaf  |wban |datetime           |latitude|longitude|elevation|windAngle|windSpeed|temperature|seaLvlPressure|cloudCoverage|presentWeatherIndicator|pastWeatherIndicator|precipTime|precipDepth|snowDepth|stationName  |countryOrRegion|p_k         |year|day|version|month|
+------+-----+-------------------+--------+---------+---------+---------+---------+-----------+--------------+-------------+-----------------------+--------------------+----------+-----------+---------+-------------+---------------+------------+----+---+-------+-----+
|999999|53182|2018-05-26 07:55:00|36.568  |-101.61  |1000.0   |null     |null   

In [15]:
// Filter out weather info for new york city, remove the recording with null temperature 

val weather_df = (
                isd_df.filter(isd_df("latitude") >= "40.53")
                        .filter(isd_df("latitude") <= "40.88")
                        .filter(isd_df("longitude") >= "-74.09")
                        .filter(isd_df("longitude") <= "-73.72")
                        .filter(isd_df("temperature").isNotNull)
                        .withColumnRenamed("datetime","datetime_full")
                        )

weather_df: org.apache.spark.sql.DataFrame = [usaf: string, wban: string ... 21 more fields]

In [16]:
// Remove unused columns
val weather_df_clean = weather_df.drop("usaf", "wban", "longitude", "latitude").withColumn("datetime", to_date(col("datetime_full")))

//weather_df_clean.show(5, truncate = false)

weather_df_clean: org.apache.spark.sql.DataFrame = [datetime_full: timestamp, elevation: double ... 18 more fields]

Next group the weather data so that you have daily aggregated weather values. 


In [17]:
// Enrich weather data with aggregation statistics

val weather_df_grouped = (
                        weather_df_clean.groupBy('datetime).
                        agg(
                            mean('snowDepth) as "avg_snowDepth",
                            max('precipTime) as "max_precipTime",
                            mean('temperature) as "avg_temperature",
                            max('precipDepth) as "max_precipDepth"
                            )
                        )

weather_df_grouped.show(5, truncate = false)

weather_df_grouped: org.apache.spark.sql.DataFrame = [datetime: date, avg_snowDepth: double ... 3 more fields]
+----------+-------------+--------------+------------------+---------------+
|datetime  |avg_snowDepth|max_precipTime|avg_temperature   |max_precipDepth|
+----------+-------------+--------------+------------------+---------------+
|2018-05-28|null         |24.0          |15.333636363636389|2540.0         |
|2018-06-06|null         |6.0           |21.4              |0.0            |
|2018-05-26|null         |24.0          |26.07233009708738 |2540.0         |
|2018-05-27|null         |24.0          |18.931365313653128|7648.0         |
|2018-06-03|null         |24.0          |18.24280303030302 |2540.0         |
+----------+-------------+--------------+------------------+---------------+
only showing top 5 rows

Merge the taxi and holiday data you prepared with the new weather data. This time you only need the datetime key, and again perform a left-join of the data. Run the describe() function on the new dataframe to see summary statistics for each field.

In [18]:
// Enrich taxi data with weather
val nyc_taxi_holiday_weather_df = nyc_taxi_holiday_df.join(weather_df_grouped, Seq("datetime") ,"left")
nyc_taxi_holiday_weather_df.cache()

nyc_taxi_holiday_weather_df: org.apache.spark.sql.DataFrame = [datetime: date, country_code: string ... 21 more fields]
res64: nyc_taxi_holiday_weather_df.type = [datetime: date, country_code: string ... 21 more fields]

In [19]:
nyc_taxi_holiday_weather_df.show(5,truncate = false)

+----------+------------+--------+-------------------+--------------+------------+---------+-----------+------+-------+---------+------------+-----------+-----------+---------------+------------+--------------------+-------------+-------------------+-------------+--------------+------------------+---------------+
|datetime  |country_code|vendorID|lpepPickupDatetime |passengerCount|tripDistance|tipAmount|totalAmount|puYear|puMonth|month_num|day_of_month|day_of_week|hour_of_day|countryOrRegion|holidayName |normalizeHolidayName|isPaidTimeOff|date               |avg_snowDepth|max_precipTime|avg_temperature   |max_precipDepth|
+----------+------------+--------+-------------------+--------------+------------+---------+-----------+------+-------+---------+------------+-----------+-----------+---------------+------------+--------------------+-------------+-------------------+-------------+--------------+------------------+---------------+
|2018-05-28|US          |2       |2018-05-28 10:28:09|1

In [20]:
// Run the describe() function on the new dataframe to see summary statistics for each field.
display(nyc_taxi_holiday_weather_df.describe())

The summary statistics shows that the totalAmount field has negative values, which don't make sense in the context.



In [21]:
// Remove invalid rows with less than 0 taxi fare or tip
val final_df = (
            nyc_taxi_holiday_weather_df.
            filter(nyc_taxi_holiday_weather_df("tipAmount") > 0).
            filter(nyc_taxi_holiday_weather_df("totalAmount") > 0)
            )

final_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [datetime: date, country_code: string ... 21 more fields]

## Cleaning up the existing Database

First we need to drop the tables since Spark requires that a database is empty before we can drop the Database.

Then we recreate the database and set the default database context to it.

In [22]:
spark.sql("DROP TABLE IF EXISTS NYCTaxi.nyc_taxi_holiday_weather"); 

res69: org.apache.spark.sql.DataFrame = []

In [23]:
spark.sql("DROP DATABASE IF EXISTS NYCTaxi"); 
spark.sql("CREATE DATABASE NYCTaxi"); 
spark.sql("USE NYCTaxi");

res70: org.apache.spark.sql.DataFrame = []
res71: org.apache.spark.sql.DataFrame = []
res72: org.apache.spark.sql.DataFrame = []

## Creating a new table
We create a nyc_taxi_holiday_weather table from the nyc_taxi_holiday_weather dataframe.


In [24]:
final_df.write.saveAsTable("nyc_taxi_holiday_weather");
val final_results = spark.sql("SELECT COUNT(*) FROM nyc_taxi_holiday_weather");
final_results.show(5, truncate = false)

final_results: org.apache.spark.sql.DataFrame = [count(1): bigint]
+--------+
|count(1)|
+--------+
|337444  |
+--------+