Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Tutorial: Load TAXI data and enrich it with Weather data

Install azureml-contrib-opendatasets package

In [1]:
!pip uninstall -y azureml-contrib-opendatasets
!pip install azureml-contrib-opendatasets

Uninstalling azureml-contrib-opendatasets-1.0.30:
  Successfully uninstalled azureml-contrib-opendatasets-1.0.30
Collecting azureml-contrib-opendatasets
  Using cached https://files.pythonhosted.org/packages/64/51/4d3de57cf210941346d907584e0e6e56780067bc3555250b1fe62c2285f7/azureml_contrib_opendatasets-1.0.30-py3-none-any.whl
Installing collected packages: azureml-contrib-opendatasets
Successfully installed azureml-contrib-opendatasets-1.0.30


Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid MemoryError with large datasets. To download a year of taxi data, iteratively fetch one month at a time, and before appending it to green_taxi_df randomly sample 200 records from the specific month to avoid bloating the dataframe.

In [2]:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
from azureml.contrib.opendatasets import NycTlcGreen


green_taxi_df = pd.DataFrame([])
start = datetime.strptime("1/1/2016", "%m/%d/%Y")
end = datetime.strptime("1/31/2016", "%m/%d/%Y")

for sample_month in [4]:
    temp_df_green = NycTlcGreen(
        start + relativedelta(months=sample_month),
        end + relativedelta(months=sample_month)).to_pandas_dataframe()
    green_taxi_df = green_taxi_df.append(temp_df_green.sample(200))



ActivityStarted, to_pandas_dataframe
Target paths: ['/puYear=2016/puMonth=5/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading green/puYear=2016/puMonth=5/part-00044-tid-6037743401120983271-619c4849-c957-4290-a1b8-66832cb385b6-12463.c000.snappy.parquet under container nyctlc
Done.
ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=21357.18 [ms]


Save a copy of the raw_columns name list for clean up at the last step.

In [3]:
raw_columns = list(green_taxi_df.columns)

The original index can fail the initialization of class LocationTimeCustomerData at below, so this is a workaround to add a monotonically increasing id column.

In [4]:
green_taxi_df['idx'] = list(range(len(green_taxi_df.index)))
green_taxi_df_idx = green_taxi_df.set_index('idx')
green_taxi_df_idx.head(5)

Unnamed: 0_level_0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,paymentType,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,2016-05-21 21:19:03,2016-05-21 22:14:52,1,11.0,,,-73.991203,40.671406,-73.983505,...,1,43.0,0.5,0.5,0.3,11.05,0.0,,55.35,1.0
1,1,2016-05-14 11:28:49,2016-05-14 11:35:31,1,1.2,,,-73.957764,40.717712,-73.940422,...,2,6.5,0.0,0.5,0.3,0.0,0.0,,7.3,1.0
2,2,2016-05-02 11:15:02,2016-05-02 11:39:01,2,9.92,,,-73.820343,40.758987,-73.780548,...,2,29.5,0.0,0.5,0.3,0.0,0.0,,30.3,1.0
3,2,2016-05-16 09:52:19,2016-05-16 10:01:23,1,0.85,,,-73.93264,40.795738,-73.936569,...,1,7.5,0.0,0.5,0.3,0.7,0.0,,9.0,1.0
4,2,2016-05-30 00:14:52,2016-05-30 00:26:28,1,2.71,,,-73.936836,40.701469,-73.958138,...,1,11.5,0.5,0.5,0.3,2.56,0.0,,15.36,1.0


Initialize LocationTimeCustomerData using pandas dataframe green_taxi.

In [5]:
# This is a contrib package in preview. The package name is subject to change.

from azureml.contrib.opendatasets.accessories.location_data import LatLongColumn
from azureml.contrib.opendatasets.accessories.location_time_customer_data \
    import LocationTimeCustomerData
from azureml.contrib.opendatasets import NoaaIsdWeather


green_taxi = LocationTimeCustomerData(
    green_taxi_df_idx,
    LatLongColumn('pickupLatitude', 'pickupLongitude'),
    'lpepPickupDatetime')

Initialize NoaaIsdWeather class, get enricher from it, and enrich the taxi data without aggregation

In [6]:
weather = NoaaIsdWeather(
    cols=["temperature", "precipTime", "precipDepth", "snowDepth"],
    start_date=datetime(2016, 5, 1, 0, 0),
    end_date=datetime(2016, 5, 31, 23, 59))
weather_enricher = weather.get_enricher()
new_green_taxi, processed_weather = weather_enricher.enrich_customer_data_no_agg(
    customer_data_object=green_taxi,
    location_match_granularity=1,
    time_round_granularity='hour')

ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=4.9 [ms]
ActivityStarted, enrich_customer_data_no_agg
ActivityStarted, enrich
Target paths: ['/year=2016/month=5/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2016/month=5/part-00006-tid-6700213360605767691-4491b75c-f137-489b-b5df-4204b9326fda-111.c000.snappy.parquet under container isdweatherdatacontainer
Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=131737.62 [ms]
ActivityCompleted: Activity=enrich_customer_data_no_agg, HowEnded=Success, Duration=131760.56 [ms]


Preview the pandas dataframe new_green_taxi.data

In [7]:
new_green_taxi.data.head(3)

Unnamed: 0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType,customer_rankgroupoedin,customer_join_timepnbn8
0,1,2016-05-21 21:19:03,2016-05-21 22:14:52,1,11.0,,,-73.991203,40.671406,-73.983505,...,0.5,0.5,0.3,11.05,0.0,,55.35,1.0,12,2016-05-21 21:00:00
1,1,2016-05-14 11:28:49,2016-05-14 11:35:31,1,1.2,,,-73.957764,40.717712,-73.940422,...,0.0,0.5,0.3,0.0,0.0,,7.3,1.0,72,2016-05-14 11:00:00
2,2,2016-05-02 11:15:02,2016-05-02 11:39:01,2,9.92,,,-73.820343,40.758987,-73.780548,...,0.0,0.5,0.3,0.0,0.0,,30.3,1.0,112,2016-05-02 11:00:00


Define a dict `aggregations` to define how to aggregate each field at a hour level. For `snowDepth` and `temperature` we'll take the mean and for `precipTime` and `precipDepth` we'll take the hourly maximum. Use the groupby() function along with the aggregations to group data.

In [8]:
aggregations = {
    "snowDepth": "mean",
    "precipTime": "max",
    "temperature": "mean",
    "precipDepth": "max"}

The keys (`public_rankgroup`, `public_join_time`, `customer_rankgroup`, `customer_join_time`) used by groupby() and later merge() must be hacked here due to the current design.

In [9]:
public_rankgroup = processed_weather.id

public_join_time = [
    s for s in list(processed_weather.data.columns)
    if s.startswith('ds_join_time')][0]

customer_rankgroup = weather_enricher.location_selector.customer_rankgroup

customer_join_time = [
    s for s in list(new_green_taxi.data.columns)
    if s.startswith('customer_join_time')][0]

weather_df_grouped = processed_weather.data.groupby(by=[public_rankgroup, public_join_time]).agg(aggregations)
weather_df_grouped.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,snowDepth,precipTime,temperature,precipDepth
public_rankgroupvt7nj,ds_join_time12t17,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2016-05-18 19:00:00,,1.0,17.2,0.0
2,2016-05-04 20:00:00,,1.0,9.5,8.0
3,2016-05-13 18:00:00,,6.0,15.733333,0.0


Join the final dataframe, and preview the joined result.

In [10]:
joined_dataset = new_green_taxi.data.merge(
    weather_df_grouped,
    left_on=[customer_rankgroup, customer_join_time],
    right_on=[public_rankgroup, public_join_time],
    how='left')

final_df = joined_dataset[raw_columns + [
    "temperature", "precipTime", "precipDepth", "snowDepth"]]
final_df.head(5)

Unnamed: 0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType,temperature,precipTime,precipDepth,snowDepth
0,1,2016-05-21 21:19:03,2016-05-21 22:14:52,1,11.0,,,-73.991203,40.671406,-73.983505,...,0.3,11.05,0.0,,55.35,1.0,17.2,1.0,0.0,
1,1,2016-05-14 11:28:49,2016-05-14 11:35:31,1,1.2,,,-73.957764,40.717712,-73.940422,...,0.3,0.0,0.0,,7.3,1.0,16.0,,,
2,2,2016-05-02 11:15:02,2016-05-02 11:39:01,2,9.92,,,-73.820343,40.758987,-73.780548,...,0.3,0.0,0.0,,30.3,1.0,8.15,1.0,0.0,
3,2,2016-05-16 09:52:19,2016-05-16 10:01:23,1,0.85,,,-73.93264,40.795738,-73.936569,...,0.3,0.7,0.0,,9.0,1.0,6.0,,,
4,2,2016-05-30 00:14:52,2016-05-30 00:26:28,1,2.71,,,-73.936836,40.701469,-73.958138,...,0.3,2.56,0.0,,15.36,1.0,23.3,1.0,0.0,


Check the join success rate.

In [11]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 0 to 199
Data columns (total 27 columns):
vendorID                200 non-null int32
lpepPickupDatetime      200 non-null datetime64[ns]
lpepDropoffDatetime     200 non-null datetime64[ns]
passengerCount          200 non-null int32
tripDistance            200 non-null float64
puLocationId            0 non-null object
doLocationId            0 non-null object
pickupLongitude         200 non-null float64
pickupLatitude          200 non-null float64
dropoffLongitude        200 non-null float64
dropoffLatitude         200 non-null float64
rateCodeID              200 non-null int32
storeAndFwdFlag         200 non-null object
paymentType             200 non-null int32
fareAmount              200 non-null float64
extra                   200 non-null float64
mtaTax                  200 non-null float64
improvementSurcharge    200 non-null object
tipAmount               200 non-null float64
tollsAmount             200 non-null float