Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Tutorial: Load TAXI data and enrich it with Weather data in Pandas DataFrame

Install azureml-contrib-opendatasets package

In [1]:
!pip uninstall -y azureml-contrib-opendatasets
!pip install azureml-contrib-opendatasets

Uninstalling azureml-contrib-opendatasets-1.0.30:
  Successfully uninstalled azureml-contrib-opendatasets-1.0.30
Collecting azureml-contrib-opendatasets
  Using cached https://files.pythonhosted.org/packages/64/51/4d3de57cf210941346d907584e0e6e56780067bc3555250b1fe62c2285f7/azureml_contrib_opendatasets-1.0.30-py3-none-any.whl
Installing collected packages: azureml-contrib-opendatasets
Successfully installed azureml-contrib-opendatasets-1.0.30


Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid MemoryError with large datasets. To download 6 months of taxi data, iteratively fetch one month at a time, and before appending it to green_taxi_df randomly sample 2000 records from the specific month to avoid bloating the dataframe.

In [2]:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
from azureml.contrib.opendatasets import NycTlcGreen


green_taxi_df = pd.DataFrame([])
start = datetime.strptime("1/1/2016", "%m/%d/%Y")
end = datetime.strptime("1/31/2016", "%m/%d/%Y")

for sample_month in range(6):
    temp_df_green = NycTlcGreen(
        start + relativedelta(months=sample_month),
        end + relativedelta(months=sample_month)).to_pandas_dataframe()
    green_taxi_df = green_taxi_df.append(temp_df_green.sample(2000))

ActivityStarted, to_pandas_dataframe
Target paths: ['/puYear=2016/puMonth=1/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading green/puYear=2016/puMonth=1/part-00119-tid-6037743401120983271-619c4849-c957-4290-a1b8-66832cb385b6-12538.c000.snappy.parquet under container nyctlc
Done.
ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=14548.1 [ms]
ActivityStarted, to_pandas_dataframe
Target paths: ['/puYear=2016/puMonth=2/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading green/puYear=2016/puMonth=2/part-00060-tid-6037743401120983271-619c4849-c957-4290-a1b8-66832cb385b6-12479.c000.snappy.parquet under container nyctlc
Done.
ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=15113.93 [ms]
ActivityStarted, to_pandas_dataframe
Target paths: ['/puYear=2016/puMonth=3/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading green/puYear=2016/puMonth=3/part-00196-tid-603774

Save a copy of the raw_columns name list for clean up at the last step.

In [3]:
raw_columns = list(green_taxi_df.columns)

<font color='red'>Get mean values of pickupLatitude and pickupLongitude</font>

In [4]:
info = green_taxi_df.describe()
info['pickupLatitude']['mean'], info['pickupLongitude']['mean']

(40.64850642585754, -73.7586577612559)

Drop the rows that both lat/long are NaN, especially all columns in the first row are NaN.

In [5]:
green_taxi_df = green_taxi_df.dropna(how='all', subset=['lpepPickupDatetime', 'pickupLatitude', 'pickupLongitude'])

Make all pickupLatitude and pickupLongitude be the center location of the city.

In [6]:
def set_lat(x):
    return info['pickupLatitude']['mean']
def set_long(x):
    return info['pickupLongitude']['mean']
green_taxi_df['pickupLatitude'] = green_taxi_df[['pickupLatitude']].apply(set_lat, axis=1)
green_taxi_df['pickupLongitude'] = green_taxi_df[['pickupLongitude']].apply(set_long, axis=1)
green_taxi_df.head(5)

Unnamed: 0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,paymentType,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType
885726,1,2016-01-26 05:53:46,2016-01-26 06:06:40,1,5.0,,,-73.758658,40.648506,-73.931755,...,2,16.0,0.5,0.5,0.3,0.0,0.0,,17.3,1.0
252487,1,2016-01-23 13:15:53,2016-01-23 13:24:05,1,1.3,,,-73.758658,40.648506,-73.936607,...,2,7.0,0.0,0.5,0.3,0.0,0.0,,7.8,1.0
99655,2,2016-01-20 08:49:42,2016-01-20 09:11:37,1,4.53,,,-73.758658,40.648506,-73.961166,...,2,18.5,0.0,0.5,0.3,0.0,0.0,,19.3,1.0
453287,1,2016-01-09 19:22:57,2016-01-09 19:48:51,1,6.1,,,-73.758658,40.648506,-73.992134,...,1,22.0,0.5,0.5,0.3,4.65,0.0,,27.95,1.0
1395372,2,2016-01-05 19:55:53,2016-01-05 19:57:46,5,0.49,,,-73.758658,40.648506,-73.976654,...,2,3.5,1.0,0.5,0.3,0.0,0.0,,5.3,1.0


The original index can fail the initialization of class LocationTimeCustomerData at below, so this is a workaround to add a monotonically increasing id column.

In [7]:
green_taxi_df['idx'] = list(range(len(green_taxi_df.index)))
green_taxi_df_idx = green_taxi_df.set_index('idx')
green_taxi_df_idx.head(5)

Unnamed: 0_level_0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,paymentType,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,2016-01-26 05:53:46,2016-01-26 06:06:40,1,5.0,,,-73.758658,40.648506,-73.931755,...,2,16.0,0.5,0.5,0.3,0.0,0.0,,17.3,1.0
1,1,2016-01-23 13:15:53,2016-01-23 13:24:05,1,1.3,,,-73.758658,40.648506,-73.936607,...,2,7.0,0.0,0.5,0.3,0.0,0.0,,7.8,1.0
2,2,2016-01-20 08:49:42,2016-01-20 09:11:37,1,4.53,,,-73.758658,40.648506,-73.961166,...,2,18.5,0.0,0.5,0.3,0.0,0.0,,19.3,1.0
3,1,2016-01-09 19:22:57,2016-01-09 19:48:51,1,6.1,,,-73.758658,40.648506,-73.992134,...,1,22.0,0.5,0.5,0.3,4.65,0.0,,27.95,1.0
4,2,2016-01-05 19:55:53,2016-01-05 19:57:46,5,0.49,,,-73.758658,40.648506,-73.976654,...,2,3.5,1.0,0.5,0.3,0.0,0.0,,5.3,1.0


Initialize LocationTimeCustomerData using pandas dataframe green_taxi.

In [8]:
# This is a contrib package in preview. The package name is subject to change.

from azureml.contrib.opendatasets.accessories.location_data import LatLongColumn
from azureml.contrib.opendatasets.accessories.location_time_customer_data \
    import LocationTimeCustomerData
from azureml.contrib.opendatasets import NoaaIsdWeather


green_taxi = LocationTimeCustomerData(
    green_taxi_df_idx,
    LatLongColumn('pickupLatitude', 'pickupLongitude'),
    'lpepPickupDatetime')

Define PandasDataLoadLimitToMonths class to load last N months of given date range.

Note that this is useful if you have a powerful machine but because of big size, expect a longer response time here. Almost 10 minutes for loading 6-month data.

In [9]:
from azure.storage.blob import BlockBlobService
from azureml.contrib.opendatasets._utils.time_utils import day_range, month_range
from azureml.contrib.opendatasets.dataaccess.pandas_data_load_limit import PandasDataLoadLimitNone


class PandasDataLoadLimitToMonths(PandasDataLoadLimitNone):
    def __init__(
            self,
            start_date,
            end_date,
            n_months,
            path_pattern='/year=%d/month=%d/'):
        self.start_date = start_date
        self.end_date = end_date
        self.n_months = n_months
        self.path_pattern = path_pattern
        super(PandasDataLoadLimitToMonths, self).__init__()

    def get_target_blob_paths(
            self,
            blob_service: BlockBlobService,
            blob_container_name: str,
            blob_relative_path: str):
        self._match_paths = []
        for current_month in month_range(self.start_date, self.end_date):
            self._match_paths.append(self.path_pattern % (current_month.year, current_month.month))

        if len(self._match_paths) > 1:
            print('We are taking the latest n months: %s' % (self._match_paths[-1]))
            self._match_paths = self._match_paths[-self.n_months:]

        print('Target paths: %s' % (self._match_paths))
        return super(PandasDataLoadLimitToMonths, self).get_target_blob_paths(
            blob_service=blob_service,
            blob_container_name=blob_container_name,
            blob_relative_path=blob_relative_path)

Define NoaaIsdWeatherForMonths class inherits from NoaaIsdWeather
By overriding method get_pandas_limit(), we can balance the data load performance and the amount of the data.

In [10]:
from azureml.contrib.opendatasets import NoaaIsdWeather
from datetime import datetime
from dateutil import parser
from typing import List, Optional

class NoaaIsdWeatherForMonths(NoaaIsdWeather):
    _default_start_date = parser.parse('2008-01-01')
    _default_end_date = datetime.today()

    def __init__(
                self,
            start_date: datetime = _default_start_date,
            end_date: datetime = _default_end_date,
            n_months: int = 6,
            cols: Optional[List[str]] = None,
            enable_telemetry: bool = False):
        self.n_months = n_months
        super(NoaaIsdWeatherForMonths, self).__init__(
            start_date=start_date, end_date=end_date, cols=cols, enable_telemetry=enable_telemetry)
        
    def get_pandas_limit(self):
        return PandasDataLoadLimitToMonths(self.start_date, self.end_date, self.n_months)

Initialize NoaaIsdWeather class, get enricher from it, and enrich the taxi data without aggregation

In [11]:
weather = NoaaIsdWeatherForMonths(
    cols=["temperature", "precipTime", "precipDepth", "snowDepth"],
    start_date=datetime(2016, 1, 1, 0, 0),
    end_date=datetime(2016, 6, 30, 23, 59),
    n_months=6)
weather_enricher = weather.get_enricher()
new_green_taxi, processed_weather = weather_enricher.enrich_customer_data_no_agg(
    customer_data_object=green_taxi,
    location_match_granularity=1,
    time_round_granularity='day')

We are taking the latest n months: /year=2016/month=6/
Target paths: ['/year=2016/month=1/', '/year=2016/month=2/', '/year=2016/month=3/', '/year=2016/month=4/', '/year=2016/month=5/', '/year=2016/month=6/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2016/month=1/part-00005-tid-6700213360605767691-4491b75c-f137-489b-b5df-4204b9326fda-110.c000.snappy.parquet under container isdweatherdatacontainer
Reading ISDWeather/year=2016/month=2/part-00011-tid-6700213360605767691-4491b75c-f137-489b-b5df-4204b9326fda-116.c000.snappy.parquet under container isdweatherdatacontainer
Reading ISDWeather/year=2016/month=3/part-00004-tid-6700213360605767691-4491b75c-f137-489b-b5df-4204b9326fda-109.c000.snappy.parquet under container isdweatherdatacontainer
Reading ISDWeather/year=2016/month=4/part-00008-tid-6700213360605767691-4491b75c-f137-489b-b5df-4204b9326fda-113.c000.snappy.parquet under container isdweatherdatacontainer
Reading ISDWeather/year=2016/mont

Preview the pandas dataframe new_green_taxi.data

In [12]:
new_green_taxi.data.head(3)

Unnamed: 0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType,customer_rankgroup171ga,customer_join_timey0boj
0,1,2016-01-26 05:53:46,2016-01-26 06:06:40,1,5.0,,,-73.758658,40.648506,-73.931755,...,0.5,0.5,0.3,0.0,0.0,,17.3,1.0,1,2016-01-26
1,1,2016-01-23 13:15:53,2016-01-23 13:24:05,1,1.3,,,-73.758658,40.648506,-73.936607,...,0.0,0.5,0.3,0.0,0.0,,7.8,1.0,1,2016-01-23
2,2,2016-01-20 08:49:42,2016-01-20 09:11:37,1,4.53,,,-73.758658,40.648506,-73.961166,...,0.0,0.5,0.3,0.0,0.0,,19.3,1.0,1,2016-01-20


Define a dict `aggregations` to define how to aggregate each field at a hour level. For `snowDepth` and `temperature` we'll take the mean and for `precipTime` and `precipDepth` we'll take the hourly maximum. Use the groupby() function along with the aggregations to group data.

In [13]:
aggregations = {
    "snowDepth": "mean",
    "precipTime": "max",
    "temperature": "mean",
    "precipDepth": "max"}

The keys (`public_rankgroup`, `public_join_time`, `customer_rankgroup`, `customer_join_time`) used by groupby() and later merge() must be hacked here due to the current design.

In [14]:
public_rankgroup = processed_weather.id

public_join_time = [
    s for s in list(processed_weather.data.columns)
    if s.startswith('ds_join_time')][0]

customer_rankgroup = weather_enricher.location_selector.customer_rankgroup

customer_join_time = [
    s for s in list(new_green_taxi.data.columns)
    if type(s) is str and s.startswith('customer_join_time')][0]

weather_df_grouped = processed_weather.data.groupby(by=[public_rankgroup, public_join_time]).agg(aggregations)
weather_df_grouped.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,snowDepth,precipTime,temperature,precipDepth
public_rankgrouphkmr7,ds_join_timeignr6,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2016-01-01,0.0,24.0,5.590625,15.0
1,2016-01-02,0.0,24.0,3.0875,0.0
1,2016-01-03,0.0,24.0,4.66875,0.0


Join the final dataframe, and preview the joined result.

In [15]:
joined_dataset = new_green_taxi.data.merge(
    weather_df_grouped,
    left_on=[customer_rankgroup, customer_join_time],
    right_on=[public_rankgroup, public_join_time],
    how='left')

final_df = joined_dataset[raw_columns + [
    "temperature", "precipTime", "precipDepth", "snowDepth"]]
final_df.head(5)

Unnamed: 0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType,temperature,precipTime,precipDepth,snowDepth
0,1,2016-01-26 05:53:46,2016-01-26 06:06:40,1,5.0,,,-73.758658,40.648506,-73.931755,...,0.3,0.0,0.0,,17.3,1.0,3.7375,24.0,0.0,55.222222
1,1,2016-01-23 13:15:53,2016-01-23 13:24:05,1,1.3,,,-73.758658,40.648506,-73.936607,...,0.3,0.0,0.0,,7.8,1.0,-1.819149,24.0,70.0,29.409091
2,2,2016-01-20 08:49:42,2016-01-20 09:11:37,1,4.53,,,-73.758658,40.648506,-73.961166,...,0.3,0.0,0.0,,19.3,1.0,0.240625,24.0,0.0,0.0
3,1,2016-01-09 19:22:57,2016-01-09 19:48:51,1,6.1,,,-73.758658,40.648506,-73.992134,...,0.3,4.65,0.0,,27.95,1.0,7.456098,24.0,3.0,0.0
4,2,2016-01-05 19:55:53,2016-01-05 19:57:46,5,0.49,,,-73.758658,40.648506,-73.976654,...,0.3,0.0,0.0,,5.3,1.0,-7.15,24.0,0.0,0.0


Check the join success rate.

In [16]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12000 entries, 0 to 11999
Data columns (total 27 columns):
vendorID                12000 non-null int32
lpepPickupDatetime      12000 non-null datetime64[ns]
lpepDropoffDatetime     12000 non-null datetime64[ns]
passengerCount          12000 non-null int32
tripDistance            12000 non-null float64
puLocationId            0 non-null object
doLocationId            0 non-null object
pickupLongitude         12000 non-null float64
pickupLatitude          12000 non-null float64
dropoffLongitude        12000 non-null float64
dropoffLatitude         12000 non-null float64
rateCodeID              12000 non-null int32
storeAndFwdFlag         12000 non-null object
paymentType             12000 non-null int32
fareAmount              12000 non-null float64
extra                   12000 non-null float64
mtaTax                  12000 non-null float64
improvementSurcharge    12000 non-null object
tipAmount               12000 non-null float64
toll