Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Tutorial: Load demo data and enrich it with NOAA ISD Weather data.

In this tutorial, you load the demo data (a parquet file in Azure Blob), check the data schema, enrich it with NOAA ISD Weather data.

Prerequisites:
> * pandas version must be 0.23.0 or above

Learn how to:
> * Load the demo data from Azure Blob
> * Check the demo data schema
> * Initialize NoaaIsdWeather class to load weather data
> * Enrich the demo data with weather data
> * Display the joined result annd stats


## Install azureml-contrib-opendatasets package

In [1]:
!pip uninstall -y azureml-contrib-opendatasets
!pip install azureml-contrib-opendatasets

Uninstalling azureml-contrib-opendatasets-1.0.30:
  Successfully uninstalled azureml-contrib-opendatasets-1.0.30
Collecting azureml-contrib-opendatasets
  Using cached https://files.pythonhosted.org/packages/64/51/4d3de57cf210941346d907584e0e6e56780067bc3555250b1fe62c2285f7/azureml_contrib_opendatasets-1.0.30-py3-none-any.whl
Installing collected packages: azureml-contrib-opendatasets
Successfully installed azureml-contrib-opendatasets-1.0.30


## Define a DemoData class to load demo parquet from Azure Blob

In [2]:
from azure.storage.blob import BlockBlobService
import pyarrow.parquet as pq
from io import BytesIO

class DemoData:
    def __init__(self):
        self.blob_account_name = "azureopendatastorage"
        self.blob_container_name = "tutorials"
        self.blob_relative_path = 'noaa_isd_weather/demo.parquet'

    def to_pandas_dataframe(self):
        blob_service = BlockBlobService(account_name=self.blob_account_name)
        byte_stream = BytesIO()
        blob = blob_service.get_blob_to_stream(
            container_name=self.blob_container_name,
            blob_name=self.blob_relative_path,
            stream=byte_stream)

        return pq.read_table(source=byte_stream).to_pandas()

## Initialize a DemoData instance and load the pandas DataFrame and check the schema

In [3]:
df = DemoData().to_pandas_dataframe()
df.dtypes

datetime               datetime64[ns]
lat                           float64
long                          float64
stations.city                  object
count                           int32
stations.dock_count             int32
dtype: object

## Display the top 5 rows in the demo data dataframe

In [4]:
df.head(5)

Unnamed: 0,datetime,lat,long,stations.city,count,stations.dock_count
0,2015-05-01,37.787152,-122.388013,San Francisco,28,15
1,2015-05-02,37.787152,-122.388013,San Francisco,5,15
2,2015-05-03,37.787152,-122.388013,San Francisco,11,15
3,2015-05-04,37.787152,-122.388013,San Francisco,24,15
4,2015-05-05,37.787152,-122.388013,San Francisco,24,15


## Initialize NoaaIsdWeather class, get the enricher from it and enrich demo data
For weather data, due to size, by default we allow reading from the last month if multiple months are passed.If you want to load more, please refer to `03-nyc-taxi-join-weather.ipynb` under this folder for how. 

In [5]:
# This is a contrib package in preview. The package name is subject to change.

from azureml.contrib.opendatasets.accessories.location_data import LatLongColumn
from azureml.contrib.opendatasets.accessories.location_time_customer_data import LocationTimeCustomerData
from azureml.contrib.opendatasets import NoaaIsdWeather
from datetime import datetime


_customer_data = LocationTimeCustomerData(df, LatLongColumn('lat', 'long'), 'datetime')
weather = NoaaIsdWeather(
    cols=["temperature", "windSpeed", "seaLvlPressure"],
    start_date=datetime(2015, 5, 1, 0, 0),
    end_date=datetime(2015, 5, 31, 23, 59))
weather_enricher = weather.get_enricher()
joined_data = weather_enricher.enrich_customer_data_with_agg(
    customer_data_object=_customer_data,
    location_match_granularity=5,
    time_round_granularity='day',
    agg='avg')

ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=19.98 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enrich
Target paths: ['/year=2015/month=5/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2015/month=5/part-00001-tid-2198075741767757560-e3eb994e-d560-4dfc-941e-0aae74c8d9ed-93.c000.snappy.parquet under container isdweatherdatacontainer
Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=53840.99 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=53858.98 [ms]


## Display the top 10 rows of the joined result

In [6]:
joined_data.data.head(10)

Unnamed: 0,datetime,lat,long,stations.city,count,stations.dock_count,windSpeed,seaLvlPressure,temperature
0,2015-05-01,37.787152,-122.388013,San Francisco,28,15,3.487931,1010.263462,17.109827
1,2015-05-02,37.787152,-122.388013,San Francisco,5,15,3.8,1011.945192,13.647619
2,2015-05-03,37.787152,-122.388013,San Francisco,11,15,3.731383,1012.007692,13.163684
3,2015-05-04,37.787152,-122.388013,San Francisco,24,15,4.45,1014.373077,12.110891
4,2015-05-05,37.787152,-122.388013,San Francisco,24,15,4.897927,1014.917308,12.473057
5,2015-05-06,37.787152,-122.388013,San Francisco,28,15,5.418947,1012.480769,12.531579
6,2015-05-07,37.787152,-122.388013,San Francisco,20,15,4.744022,1008.820192,12.23587
7,2015-05-08,37.787152,-122.388013,San Francisco,21,15,3.018817,1010.196154,13.047312
8,2015-05-09,37.787152,-122.388013,San Francisco,9,15,3.629231,1017.961765,12.54949
9,2015-05-10,37.787152,-122.388013,San Francisco,10,15,4.752475,1018.724038,12.211275


## Check the stats of joined result

In [7]:
joined_data.data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1850 entries, 0 to 1849
Data columns (total 9 columns):
datetime               1850 non-null datetime64[ns]
lat                    1850 non-null float64
long                   1850 non-null float64
stations.city          1850 non-null object
count                  1850 non-null int32
stations.dock_count    1850 non-null int32
windSpeed              1850 non-null float64
seaLvlPressure         1850 non-null float64
temperature            1850 non-null float64
dtypes: datetime64[ns](1), float64(5), int32(2), object(1)
memory usage: 130.1+ KB
