Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Tutorial: Load demo data and enrich it with NOAA ISD Weather data.

In this tutorial, you load the demo data (a parquet file in Azure Blob), check the data schema, enrich it with NOAA ISD Weather data.

Prerequisites:
> * pandas version must be 0.23.0 or above

Learn how to:
> * Load the demo data from Azure Blob
> * Check the demo data schema
> * Initialize NoaaIsdWeather class to load weather data
> * Enrich the demo data with weather data
> * Display the joined result annd stats


## Install azureml-contrib-opendatasets package

In [1]:
!pip uninstall -y azureml-contrib-opendatasets
!pip install azureml-contrib-opendatasets --index-url https://azuremlsdktestpypi.azureedge.net/sdk-release/Candidate/604C89A437BA41BD942B4F46D9A3591D

Uninstalling azureml-contrib-opendatasets-1.0.28:
  Successfully uninstalled azureml-contrib-opendatasets-1.0.28
Looking in indexes: https://azuremlsdktestpypi.azureedge.net/sdk-release/Candidate/604C89A437BA41BD942B4F46D9A3591D
Collecting azureml-contrib-opendatasets
  Using cached https://azuremlsdktestpypi.blob.core.windows.net/repo/sdk-release/Candidate/604C89A437BA41BD942B4F46D9A3591D/azureml_contrib_opendatasets-1.0.28-py3-none-any.whl?sv=2017-07-29&sr=b&sig=91J%2BQMAXMowy%2Ftt0%2B6QhdwMNTF6Ev%2FZw81KYeSq4tBE%3D&st=2019-04-17T20%3A43%3A38Z&se=2020-04-17T20%3A43%3A38Z&sp=rl
Installing collected packages: azureml-contrib-opendatasets
Successfully installed azureml-contrib-opendatasets-1.0.28


## Define a DemoData class to load demo parquet from Azure Blob
> <font color="green">
    You can change the script by setting your blob account/container/path to load your own parquet file.
    <br>(line 7...9)
</font>

In [2]:
from azure.storage.blob import BlockBlobService
import pyarrow.parquet as pq
from io import BytesIO

class DemoData:
    def __init__(self):
        self.blob_account_name = "azureopendatastorage"
        self.blob_container_name = "tutorials"
        self.blob_relative_path = 'noaa_isd_weather/demo.parquet'

    def to_pandas_dataframe(self):
        blob_service = BlockBlobService(account_name=self.blob_account_name)
        byte_stream = BytesIO()
        blob = blob_service.get_blob_to_stream(
            container_name=self.blob_container_name,
            blob_name=self.blob_relative_path,
            stream=byte_stream)

        return pq.read_table(source=byte_stream).to_pandas()

## Initialize a DemoData instance and load the pandas DataFrame and check the schema

In [3]:
df = DemoData().to_pandas_dataframe()
df.dtypes

datetime               datetime64[ns]
lat                           float64
long                          float64
stations.city                  object
count                           int32
stations.dock_count             int32
dtype: object

## Display the top 5 rows in the demo data dataframe

In [4]:
df.head(5)

Unnamed: 0,datetime,lat,long,stations.city,count,stations.dock_count
0,2015-05-01,37.787152,-122.388013,San Francisco,28,15
1,2015-05-02,37.787152,-122.388013,San Francisco,5,15
2,2015-05-03,37.787152,-122.388013,San Francisco,11,15
3,2015-05-04,37.787152,-122.388013,San Francisco,24,15
4,2015-05-05,37.787152,-122.388013,San Francisco,24,15


## Define PandasDataLoadLimitToMonths class to load last N months of given date range

In [5]:
from azure.storage.blob import BlockBlobService
from azureml.contrib.opendatasets._utils.time_utils import day_range, month_range
from azureml.contrib.opendatasets.dataaccess.pandas_data_load_limit import PandasDataLoadLimitNone


class PandasDataLoadLimitToMonths(PandasDataLoadLimitNone):
    def __init__(
            self,
            start_date,
            end_date,
            n_months,
            path_pattern='/year=%d/month=%d/'):
        self.start_date = start_date
        self.end_date = end_date
        self.n_months = n_months
        self.path_pattern = path_pattern
        super(PandasDataLoadLimitToMonths, self).__init__()

    def get_target_blob_paths(
            self,
            blob_service: BlockBlobService,
            blob_container_name: str,
            blob_relative_path: str):
        self._match_paths = []
        for current_month in month_range(self.start_date, self.end_date):
            self._match_paths.append(self.path_pattern % (current_month.year, current_month.month))

        if len(self._match_paths) > 1:
            print('We are taking the latest n months: %s' % (self._match_paths[-1]))
            self._match_paths = self._match_paths[-self.n_months:]

        print('Target paths: %s' % (self._match_paths))
        return super(PandasDataLoadLimitToMonths, self).get_target_blob_paths(
            blob_service=blob_service,
            blob_container_name=blob_container_name,
            blob_relative_path=blob_relative_path)

In [6]:
help(PandasDataLoadLimitToMonths)

Help on class PandasDataLoadLimitToMonths in module __main__:

class PandasDataLoadLimitToMonths(azureml.contrib.opendatasets.dataaccess.pandas_data_load_limit.PandasDataLoadLimitNone)
 |  PandasDataLoadLimitNone controls how many parquets will be loaded (no limit).
 |  
 |  Method resolution order:
 |      PandasDataLoadLimitToMonths
 |      azureml.contrib.opendatasets.dataaccess.pandas_data_load_limit.PandasDataLoadLimitNone
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, start_date, end_date, n_months, path_pattern='/year=%d/month=%d/')
 |      Initialize _match_paths.
 |  
 |  get_target_blob_paths(self, blob_service:azure.storage.blob.blockblobservice.BlockBlobService, blob_container_name:str, blob_relative_path:str)
 |      Get target blob paths based on its own filters.
 |      
 |      :param blob_service: block blob service.
 |      :type blob_service: BlockBlobService
 |      :param blob_container_name: blob container name
 |      :type blob_co

## Define NoaaIsdWeatherForMonths class inherits from NoaaIsdWeather
By overriding method get_pandas_limit(), we can balance the data load performance and the amount of the data.

In [7]:
from azureml.contrib.opendatasets import NoaaIsdWeather
from datetime import datetime
from dateutil import parser
from typing import List, Optional

class NoaaIsdWeatherForMonths(NoaaIsdWeather):
    _default_start_date = parser.parse('2008-01-01')
    _default_end_date = datetime.today()

    def __init__(
                self,
            start_date: datetime = _default_start_date,
            end_date: datetime = _default_end_date,
            n_months: int = 6,
            cols: Optional[List[str]] = None,
            enable_telemetry: bool = False):
        self.n_months = n_months
        super(NoaaIsdWeatherForMonths, self).__init__(
            start_date=start_date, end_date=end_date, cols=cols, enable_telemetry=enable_telemetry)
        
    def get_pandas_limit(self):
        return PandasDataLoadLimitToMonths(self.start_date, self.end_date, self.n_months)

## Enrich the demo data
> * Initialize NoaaIsdWeatherForMonths class at line 9-13.
>> * Pass start/end date = 2015/4 - 2015/6
>> * Pass n_months=3 to get the last 3 months of the data.
> * Get the enricher from it and enrich demo data at line 14.
> * Call enrich_customer_data_with_agg to do enrichment.
>> * Pass location_match_granularity=5 to get the closest 5 weather stations.
>> * Pass time_round_granularity='day' to round the datetime to 'day'.
>> * Pass agg='avg' to use aggregator_avg to do aggregation at the last stage of enriching.

In [8]:
# This is a contrib package in preview. The package name is subject to change.

from azureml.contrib.opendatasets.accessories.location_data import LatLongColumn
from azureml.contrib.opendatasets.accessories.location_time_customer_data import LocationTimeCustomerData
from datetime import datetime


_customer_data = LocationTimeCustomerData(df, LatLongColumn('lat', 'long'), 'datetime')
weather = NoaaIsdWeatherForMonths(
    cols=["temperature", "windSpeed", "seaLvlPressure"],
    start_date=datetime(2015, 4, 1, 0, 0),
    end_date=datetime(2015, 6, 30, 23, 59),
    n_months=3)
weather_enricher = weather.get_enricher()
joined_data = weather_enricher.enrich_customer_data_with_agg(
    customer_data_object=_customer_data,
    location_match_granularity=5,
    time_round_granularity='day',
    agg='avg')

We are taking the latest n months: /year=2015/month=6/
Target paths: ['/year=2015/month=4/', '/year=2015/month=5/', '/year=2015/month=6/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2015/month=4/part-00011-tid-2198075741767757560-e3eb994e-d560-4dfc-941e-0aae74c8d9ed-103.c000.snappy.parquet under container isdweatherdatacontainer
Reading ISDWeather/year=2015/month=5/part-00001-tid-2198075741767757560-e3eb994e-d560-4dfc-941e-0aae74c8d9ed-93.c000.snappy.parquet under container isdweatherdatacontainer
Reading ISDWeather/year=2015/month=6/part-00008-tid-2198075741767757560-e3eb994e-d560-4dfc-941e-0aae74c8d9ed-100.c000.snappy.parquet under container isdweatherdatacontainer
Done.


## Display the top 10 rows of the joined result

In [9]:
joined_data.data.head(10)

Unnamed: 0,datetime,lat,long,stations.city,count,stations.dock_count,temperature,seaLvlPressure,windSpeed
0,2015-05-01,37.787152,-122.388013,San Francisco,28,15,17.109827,1010.263462,3.487931
1,2015-05-02,37.787152,-122.388013,San Francisco,5,15,13.647619,1011.945192,3.8
2,2015-05-03,37.787152,-122.388013,San Francisco,11,15,13.163684,1012.007692,3.731383
3,2015-05-04,37.787152,-122.388013,San Francisco,24,15,12.110891,1014.373077,4.45
4,2015-05-05,37.787152,-122.388013,San Francisco,24,15,12.473057,1014.917308,4.897927
5,2015-05-06,37.787152,-122.388013,San Francisco,28,15,12.531579,1012.480769,5.418947
6,2015-05-07,37.787152,-122.388013,San Francisco,20,15,12.23587,1008.820192,4.744022
7,2015-05-08,37.787152,-122.388013,San Francisco,21,15,13.047312,1010.196154,3.018817
8,2015-05-09,37.787152,-122.388013,San Francisco,9,15,12.54949,1017.961765,3.629231
9,2015-05-10,37.787152,-122.388013,San Francisco,10,15,12.211275,1018.724038,4.752475


## Check the stats of joined result

In [10]:
joined_data.data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1850 entries, 0 to 1849
Data columns (total 9 columns):
datetime               1850 non-null datetime64[ns]
lat                    1850 non-null float64
long                   1850 non-null float64
stations.city          1850 non-null object
count                  1850 non-null int32
stations.dock_count    1850 non-null int32
temperature            1850 non-null float64
seaLvlPressure         1850 non-null float64
windSpeed              1850 non-null float64
dtypes: datetime64[ns](1), float64(5), int32(2), object(1)
memory usage: 130.1+ KB
