Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Tutorial: Load demo data and enrich it with NOAA ISD Weather data.

In this tutorial, you load the demo data (a parquet file in Azure Blob), check the data schema, enrich it with NOAA ISD Weather data.

Prerequisites:
> You must install the PyPi package on the cluster:
> * pandas version must be 0.23.0 or above
> * azureml-contrib-opendatasets

Learn how to:
> * Load the demo data from Azure Blob
> * Check the demo data schema
> * Initialize NoaaIsdWeather class to load weather data
> * Enrich the demo data with weather data
> * Display the joined result annd stats

## Install azureml-opendatasets package

In [4]:
!pip uninstall -y azureml-opendatasets
!pip install azureml-opendatasets

## Define a DemoData class to load demo parquet from Azure Blob

In [6]:
from azure.storage.blob import BlockBlobService
import pyarrow.parquet as pq
from io import BytesIO

class DemoData:
    def __init__(self):
        self.blob_account_name = "azureopendatastorage"
        self.blob_container_name = "tutorials"
        self.blob_relative_path = 'noaa_isd_weather/demo.parquet'

    def to_pandas_dataframe(self):
        blob_service = BlockBlobService(account_name=self.blob_account_name)
        byte_stream = BytesIO()
        blob = blob_service.get_blob_to_stream(
            container_name=self.blob_container_name,
            blob_name=self.blob_relative_path,
            stream=byte_stream)

        return pq.read_table(source=byte_stream).to_pandas()

## Initialize a DemoData instance and load the pandas DataFrame and check the schema

In [8]:
df = DemoData().to_pandas_dataframe()
df.dtypes

## Display the top 5 rows in the demo data dataframe

In [10]:
df.head(5)

## Initialize NoaaIsdWeather class, get the enricher from it and enrich demo data
For weather data, due to size, by default we allow reading from the last month if multiple months are passed.If you want to load more, please refer to `04-nyc-taxi-join-weather-in-pandas.ipynb.ipynb` under this folder for how.

The logic for join:

The join logic for Pandas version is using cKDTree to accelerate the speed of the process. We gather the public weather dataset as long/lat point array, pass it to create cKDTree. Then gather the customer dataset as long/lat point array, pass it to cKDTree query function, to find the closest point in cKDTree. After querying cKDTree, we join public weather dataset with customer dataset by the querying result, then grant ranking group id.

In [12]:
# This is a package in preview.

from azureml.opendatasets.accessories.location_data import LatLongColumn
from azureml.opendatasets.accessories.location_time_customer_data import LocationTimeCustomerData
from azureml.opendatasets import NoaaIsdWeather
from datetime import datetime


_customer_data = LocationTimeCustomerData(df, LatLongColumn('lat', 'long'), 'datetime')
weather = NoaaIsdWeather(
    cols=["temperature", "windSpeed", "seaLvlPressure"],
    start_date=datetime(2015, 5, 1, 0, 0),
    end_date=datetime(2015, 5, 31, 23, 59))
weather_enricher = weather.get_enricher()
joined_data = weather_enricher.enrich_customer_data_with_agg(
    customer_data_object=_customer_data,
    location_match_granularity=5,
    time_round_granularity='day',
    agg='avg')

## Display the top 10 rows of the joined result

In [14]:
joined_data.data.head(10)

## Check the stats of joined result

In [16]:
joined_data.data.info()