Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Tutorial: Load TAXI data and enrich it with Weather data

Install azureml-contrib-opendatasets package

In [1]:
!pip uninstall -y azureml-contrib-opendatasets
!pip install azureml-contrib-opendatasets

Uninstalling azureml-contrib-opendatasets-1.0.30:
  Successfully uninstalled azureml-contrib-opendatasets-1.0.30
Collecting azureml-contrib-opendatasets
  Using cached https://files.pythonhosted.org/packages/64/51/4d3de57cf210941346d907584e0e6e56780067bc3555250b1fe62c2285f7/azureml_contrib_opendatasets-1.0.30-py3-none-any.whl
Installing collected packages: azureml-contrib-opendatasets
Successfully installed azureml-contrib-opendatasets-1.0.30


Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid MemoryError with large datasets. To download a year of taxi data, iteratively fetch one month at a time, and before appending it to green_taxi_df randomly sample 2000 records from the specific month to avoid bloating the dataframe.

In [2]:
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
from azureml.contrib.opendatasets import NycTlcGreen


green_taxi_df = pd.DataFrame([4])
start = datetime.strptime("1/1/2016", "%m/%d/%Y")
end = datetime.strptime("1/31/2016", "%m/%d/%Y")

for sample_month in range(5):
    temp_df_green = NycTlcGreen(
        start + relativedelta(months=sample_month),
        end + relativedelta(months=sample_month)).to_pandas_dataframe()
    green_taxi_df = green_taxi_df.append(temp_df_green.sample(2000))

ActivityStarted, to_pandas_dataframe
Target paths: ['/puYear=2016/puMonth=1/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading green/puYear=2016/puMonth=1/part-00119-tid-6037743401120983271-619c4849-c957-4290-a1b8-66832cb385b6-12538.c000.snappy.parquet under container nyctlc
Done.
ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=16476.84 [ms]


  return self._int64index.union(other)


ActivityStarted, to_pandas_dataframe
Target paths: ['/puYear=2016/puMonth=2/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading green/puYear=2016/puMonth=2/part-00060-tid-6037743401120983271-619c4849-c957-4290-a1b8-66832cb385b6-12479.c000.snappy.parquet under container nyctlc
Done.
ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=16129.08 [ms]


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  sort=sort)


ActivityStarted, to_pandas_dataframe
Target paths: ['/puYear=2016/puMonth=3/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading green/puYear=2016/puMonth=3/part-00196-tid-6037743401120983271-619c4849-c957-4290-a1b8-66832cb385b6-12615.c000.snappy.parquet under container nyctlc
Done.
ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=13349.99 [ms]
ActivityStarted, to_pandas_dataframe
Target paths: ['/puYear=2016/puMonth=4/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading green/puYear=2016/puMonth=4/part-00121-tid-6037743401120983271-619c4849-c957-4290-a1b8-66832cb385b6-12540.c000.snappy.parquet under container nyctlc
Done.
ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=16494.78 [ms]
ActivityStarted, to_pandas_dataframe
Target paths: ['/puYear=2016/puMonth=5/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading green/puYear=2016/puMonth=5/part-00044-tid-60377

Save a copy of the raw_columns name list for clean up at the last step.

In [3]:
raw_columns = list(green_taxi_df.columns)

<font color='red'>Get mean values of pickupLatitude and pickupLongitude</font>

In [4]:
info = green_taxi_df.describe()
info['pickupLatitude']['mean'], info['pickupLongitude']['mean']

(40.68688348197937, -73.82572079849243)

Drop the rows that both lat/long are NaN, especially all columns in the first row are NaN.

In [5]:
green_taxi_df = green_taxi_df.dropna(how='all', subset=['lpepPickupDatetime', 'pickupLatitude', 'pickupLongitude'])

Let all pickupLatitude and pickupLongitude be the center location of the city.

In [6]:
def set_lat(x):
    return info['pickupLatitude']['mean']
def set_long(x):
    return info['pickupLongitude']['mean']
green_taxi_df['pickupLatitude'] = green_taxi_df[['pickupLatitude']].apply(set_lat, axis=1)
green_taxi_df['pickupLongitude'] = green_taxi_df[['pickupLongitude']].apply(set_long, axis=1)
green_taxi_df.head(5)

Unnamed: 0,0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,...,paymentType,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType
67793,,2.0,2016-01-19 14:05:49,2016-01-19 14:25:34,1.0,8.07,,,-73.825721,40.686883,...,2.0,24.5,0.0,0.5,0.3,0.0,0.0,,25.3,1.0
940818,,2.0,2016-01-27 11:43:01,2016-01-27 12:23:13,1.0,7.2,,,-73.825721,40.686883,...,2.0,30.5,0.0,0.5,0.3,0.0,0.0,,31.3,1.0
1343631,,2.0,2016-01-04 17:12:26,2016-01-04 17:22:05,1.0,1.38,,,-73.825721,40.686883,...,2.0,8.0,1.0,0.5,0.3,0.0,0.0,,9.8,1.0
373214,,1.0,2016-01-08 10:53:24,2016-01-08 11:11:32,1.0,3.8,,,-73.825721,40.686883,...,2.0,15.0,0.0,0.5,0.3,0.0,0.0,,15.8,1.0
1228237,,2.0,2016-01-02 00:43:42,2016-01-02 00:53:02,1.0,1.85,,,-73.825721,40.686883,...,1.0,9.0,0.5,0.5,0.3,0.0,0.0,,10.3,1.0


The original index can fail the initialization of class LocationTimeCustomerData at below, so this is a workaround to add a monotonically increasing id column.

In [7]:
green_taxi_df['idx'] = list(range(len(green_taxi_df.index)))
green_taxi_df_idx = green_taxi_df.set_index('idx')
green_taxi_df_idx.head(5)

Unnamed: 0_level_0,0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,...,paymentType,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,2.0,2016-01-19 14:05:49,2016-01-19 14:25:34,1.0,8.07,,,-73.825721,40.686883,...,2.0,24.5,0.0,0.5,0.3,0.0,0.0,,25.3,1.0
1,,2.0,2016-01-27 11:43:01,2016-01-27 12:23:13,1.0,7.2,,,-73.825721,40.686883,...,2.0,30.5,0.0,0.5,0.3,0.0,0.0,,31.3,1.0
2,,2.0,2016-01-04 17:12:26,2016-01-04 17:22:05,1.0,1.38,,,-73.825721,40.686883,...,2.0,8.0,1.0,0.5,0.3,0.0,0.0,,9.8,1.0
3,,1.0,2016-01-08 10:53:24,2016-01-08 11:11:32,1.0,3.8,,,-73.825721,40.686883,...,2.0,15.0,0.0,0.5,0.3,0.0,0.0,,15.8,1.0
4,,2.0,2016-01-02 00:43:42,2016-01-02 00:53:02,1.0,1.85,,,-73.825721,40.686883,...,1.0,9.0,0.5,0.5,0.3,0.0,0.0,,10.3,1.0


Initialize LocationTimeCustomerData using pandas dataframe green_taxi.

In [8]:
# This is a contrib package in preview. The package name is subject to change.

from azureml.contrib.opendatasets.accessories.location_data import LatLongColumn
from azureml.contrib.opendatasets.accessories.location_time_customer_data \
    import LocationTimeCustomerData
from azureml.contrib.opendatasets import NoaaIsdWeather


green_taxi = LocationTimeCustomerData(
    green_taxi_df_idx,
    LatLongColumn('pickupLatitude', 'pickupLongitude'),
    'lpepPickupDatetime')

Define PandasDataLoadLimitToMonths class to load last N months of given date range

In [9]:
from azure.storage.blob import BlockBlobService
from azureml.contrib.opendatasets._utils.time_utils import day_range, month_range
from azureml.contrib.opendatasets.dataaccess.pandas_data_load_limit import PandasDataLoadLimitNone


class PandasDataLoadLimitToMonths(PandasDataLoadLimitNone):
    def __init__(
            self,
            start_date,
            end_date,
            n_months,
            path_pattern='/year=%d/month=%d/'):
        self.start_date = start_date
        self.end_date = end_date
        self.n_months = n_months
        self.path_pattern = path_pattern
        super(PandasDataLoadLimitToMonths, self).__init__()

    def get_target_blob_paths(
            self,
            blob_service: BlockBlobService,
            blob_container_name: str,
            blob_relative_path: str):
        self._match_paths = []
        for current_month in month_range(self.start_date, self.end_date):
            self._match_paths.append(self.path_pattern % (current_month.year, current_month.month))

        if len(self._match_paths) > 1:
            print('We are taking the latest n months: %s' % (self._match_paths[-1]))
            self._match_paths = self._match_paths[-self.n_months:]

        print('Target paths: %s' % (self._match_paths))
        return super(PandasDataLoadLimitToMonths, self).get_target_blob_paths(
            blob_service=blob_service,
            blob_container_name=blob_container_name,
            blob_relative_path=blob_relative_path)

Define NoaaIsdWeatherForMonths class inherits from NoaaIsdWeather
By overriding method get_pandas_limit(), we can balance the data load performance and the amount of the data.

In [10]:
from azureml.contrib.opendatasets import NoaaIsdWeather
from datetime import datetime
from dateutil import parser
from typing import List, Optional

class NoaaIsdWeatherForMonths(NoaaIsdWeather):
    _default_start_date = parser.parse('2008-01-01')
    _default_end_date = datetime.today()

    def __init__(
                self,
            start_date: datetime = _default_start_date,
            end_date: datetime = _default_end_date,
            n_months: int = 6,
            cols: Optional[List[str]] = None,
            enable_telemetry: bool = False):
        self.n_months = n_months
        super(NoaaIsdWeatherForMonths, self).__init__(
            start_date=start_date, end_date=end_date, cols=cols, enable_telemetry=enable_telemetry)
        
    def get_pandas_limit(self):
        return PandasDataLoadLimitToMonths(self.start_date, self.end_date, self.n_months)

Initialize NoaaIsdWeather class, get enricher from it, and enrich the taxi data without aggregation

In [11]:
weather = NoaaIsdWeatherForMonths(
    cols=["temperature", "precipTime", "precipDepth", "snowDepth"],
    start_date=datetime(2016, 2, 1, 0, 0),
    end_date=datetime(2016, 5, 31, 23, 59),
    n_months=3)
weather_enricher = weather.get_enricher()
new_green_taxi, processed_weather = weather_enricher.enrich_customer_data_no_agg(
    customer_data_object=green_taxi,
    location_match_granularity=1,
    time_round_granularity='hour')

We are taking the latest n months: /year=2016/month=5/
Target paths: ['/year=2016/month=3/', '/year=2016/month=4/', '/year=2016/month=5/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2016/month=3/part-00004-tid-6700213360605767691-4491b75c-f137-489b-b5df-4204b9326fda-109.c000.snappy.parquet under container isdweatherdatacontainer
Reading ISDWeather/year=2016/month=4/part-00008-tid-6700213360605767691-4491b75c-f137-489b-b5df-4204b9326fda-113.c000.snappy.parquet under container isdweatherdatacontainer
Reading ISDWeather/year=2016/month=5/part-00006-tid-6700213360605767691-4491b75c-f137-489b-b5df-4204b9326fda-111.c000.snappy.parquet under container isdweatherdatacontainer
Done.


Preview the pandas dataframe new_green_taxi.data

In [12]:
new_green_taxi.data.head(3)

Unnamed: 0,0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,...,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType,customer_rankgroup47p4i,customer_join_timei5wtk
0,,2.0,2016-01-19 14:05:49,2016-01-19 14:25:34,1.0,8.07,,,-73.825721,40.686883,...,0.0,0.5,0.3,0.0,0.0,,25.3,1.0,1,2016-01-19 14:00:00
1,,2.0,2016-01-27 11:43:01,2016-01-27 12:23:13,1.0,7.2,,,-73.825721,40.686883,...,0.0,0.5,0.3,0.0,0.0,,31.3,1.0,1,2016-01-27 12:00:00
2,,2.0,2016-01-04 17:12:26,2016-01-04 17:22:05,1.0,1.38,,,-73.825721,40.686883,...,1.0,0.5,0.3,0.0,0.0,,9.8,1.0,1,2016-01-04 17:00:00


Define a dict `aggregations` to define how to aggregate each field at a hour level. For `snowDepth` and `temperature` we'll take the mean and for `precipTime` and `precipDepth` we'll take the hourly maximum. Use the groupby() function along with the aggregations to group data.

In [13]:
aggregations = {
    "snowDepth": "mean",
    "precipTime": "max",
    "temperature": "mean",
    "precipDepth": "max"}

The keys (`public_rankgroup`, `public_join_time`, `customer_rankgroup`, `customer_join_time`) used by groupby() and later merge() must be hacked here due to the current design.

In [14]:
public_rankgroup = processed_weather.id

public_join_time = [
    s for s in list(processed_weather.data.columns)
    if s.startswith('ds_join_time')][0]

customer_rankgroup = weather_enricher.location_selector.customer_rankgroup

customer_join_time = [
    s for s in list(new_green_taxi.data.columns)
    if type(s) is str and s.startswith('customer_join_time')][0]

weather_df_grouped = processed_weather.data.groupby(by=[public_rankgroup, public_join_time]).agg(aggregations)
weather_df_grouped.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,snowDepth,precipTime,temperature,precipDepth
public_rankgroup38yuq,ds_join_timepk80f,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2016-03-01 04:00:00,,1.0,8.3,0.0
1,2016-03-01 05:00:00,0.0,24.0,6.1,15.0
1,2016-03-01 07:00:00,,1.0,8.3,0.0


Join the final dataframe, and preview the joined result.

In [15]:
joined_dataset = new_green_taxi.data.merge(
    weather_df_grouped,
    left_on=[customer_rankgroup, customer_join_time],
    right_on=[public_rankgroup, public_join_time],
    how='left')

final_df = joined_dataset[raw_columns + [
    "temperature", "precipTime", "precipDepth", "snowDepth"]]
final_df.head(5)

Unnamed: 0,0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,...,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType,temperature,precipTime,precipDepth,snowDepth
0,,2.0,2016-01-19 14:05:49,2016-01-19 14:25:34,1.0,8.07,,,-73.825721,40.686883,...,0.3,0.0,0.0,,25.3,1.0,,,,
1,,2.0,2016-01-27 11:43:01,2016-01-27 12:23:13,1.0,7.2,,,-73.825721,40.686883,...,0.3,0.0,0.0,,31.3,1.0,,,,
2,,2.0,2016-01-04 17:12:26,2016-01-04 17:22:05,1.0,1.38,,,-73.825721,40.686883,...,0.3,0.0,0.0,,9.8,1.0,,,,
3,,1.0,2016-01-08 10:53:24,2016-01-08 11:11:32,1.0,3.8,,,-73.825721,40.686883,...,0.3,0.0,0.0,,15.8,1.0,,,,
4,,2.0,2016-01-02 00:43:42,2016-01-02 00:53:02,1.0,1.85,,,-73.825721,40.686883,...,0.3,0.0,0.0,,10.3,1.0,,,,


Check the join success rate.

In [16]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 28 columns):
0                       0 non-null float64
vendorID                10000 non-null float64
lpepPickupDatetime      10000 non-null datetime64[ns]
lpepDropoffDatetime     10000 non-null datetime64[ns]
passengerCount          10000 non-null float64
tripDistance            10000 non-null float64
puLocationId            0 non-null object
doLocationId            0 non-null object
pickupLongitude         10000 non-null float64
pickupLatitude          10000 non-null float64
dropoffLongitude        10000 non-null float64
dropoffLatitude         10000 non-null float64
rateCodeID              10000 non-null float64
storeAndFwdFlag         10000 non-null object
paymentType             10000 non-null float64
fareAmount              10000 non-null float64
extra                   10000 non-null float64
mtaTax                  10000 non-null float64
improvementSurcharge    10000 non-null object
t