# Load nyc_energy and enrich it with weather data

In this notebook, we try to enrich the NYC Energy data in Jupyter Notebook in a scalable way.
We enrich the input data by month, put the monthly enriched data in the temp folder, and save the final result in the current folder every time we have done one month.

* Load csv file which is downloaded from:  https://notebooks.azure.com/frlazzeri/projects/automatedml-ms-build/html/nyc_energy.csv
* Time range: 1/1/2012  to 8/12/2017
* Location: 'PORT AUTH DOWNTN MANHATTAN WALL ST' station at <font color='red'>lat: 40.701, long: -74.009</font>

In [1]:
# install packages if it's not availble.

# !pip uninstall -y azureml-contrib-opendatasets
# !pip install azureml-contrib-opendatasets

### Initialize global variables.

In [2]:
from datetime import datetime


start_date = datetime(2012, 1, 1, 0, 0)
end_date = datetime(2017, 8, 12, 23, 59)

start_date, end_date

(datetime.datetime(2012, 1, 1, 0, 0), datetime.datetime(2017, 8, 12, 23, 59))

In [3]:
from datetime import timedelta
from dateutil.relativedelta import relativedelta

import math


r = relativedelta(end_date, start_date)
months = r.years * 12 + r.months + math.floor((r.days + 30)/31)
months

68

In [4]:
lat, long = 40.701, -74.009
lat, long

(40.701, -74.009)

### Load ``"./nyc_energy.csv"`` (download and save to local) and preview the data.

In [5]:
from pandas import read_csv


df = read_csv('./nyc_energy.csv').drop(columns=['precip', 'temp'], axis=1)
df['lat'] = lat
df['long'] = long
df.head(5)

Unnamed: 0,timeStamp,demand,lat,long
0,2012-01-01 00:00:00,4937.5,40.701,-74.009
1,2012-01-01 01:00:00,4752.1,40.701,-74.009
2,2012-01-01 02:00:00,4542.6,40.701,-74.009
3,2012-01-01 03:00:00,4357.7,40.701,-74.009
4,2012-01-01 04:00:00,4275.5,40.701,-74.009


### Extend the timeStamp column so that we can filter it easily.

In [6]:
from dateutil import parser


df['datetime'] = df['timeStamp'].apply(parser.parse)
df.head(5)

Unnamed: 0,timeStamp,demand,lat,long,datetime
0,2012-01-01 00:00:00,4937.5,40.701,-74.009,2012-01-01 00:00:00
1,2012-01-01 01:00:00,4752.1,40.701,-74.009,2012-01-01 01:00:00
2,2012-01-01 02:00:00,4542.6,40.701,-74.009,2012-01-01 02:00:00
3,2012-01-01 03:00:00,4357.7,40.701,-74.009,2012-01-01 03:00:00
4,2012-01-01 04:00:00,4275.5,40.701,-74.009,2012-01-01 04:00:00


### Create the temp folder in which we save the enriched data per month.

In [7]:
!if [ ! -d "./temp" ]; then mkdir temp; fi

In [8]:
import os.path
import pandas as pd
from azureml.contrib.opendatasets.accessories.location_data import LatLongColumn
from azureml.contrib.opendatasets.accessories.location_time_customer_data \
    import LocationTimeCustomerData
from azureml.contrib.opendatasets import NoaaIsdWeather


if os.path.exists('./nyc_energy_enriched.csv'):
    print('nyc_energy_enriched.csv exists already.')
else:
    print('[%s] Start enriching...' % datetime.now())
    all = pd.DataFrame([])
    i_date = start_date
    for m in range(months):
        j_date = i_date + relativedelta(months=1) - timedelta(milliseconds=1)

        # This is important to set monotonically increasing index for successful enrichemnt.
        df1 = df[(df['datetime'] >= i_date) & (df['datetime'] <= j_date)]
        df1['idx'] = list(range(len(df1.index)))
        df1 = df1.set_index('idx')

        energy = LocationTimeCustomerData(
            df1,
            LatLongColumn('lat', 'long'),
            'datetime')

        weather = NoaaIsdWeather(
            cols=["temperature", "precipTime", "precipDepth", "snowDepth"],
            start_date=i_date,
            end_date=j_date)

        weather_enricher = weather.get_enricher()
        joined_data = weather_enricher.enrich_customer_data_with_agg(
            customer_data_object=energy,
            location_match_granularity=5, # higher for high join success rate, lower for performance.
            time_round_granularity='day',
            agg='avg')

        fn = './temp/nyc_energy_enriched_%s.csv' % i_date
        joined_data.data.to_csv(fn)

        all = pd.concat([all, joined_data.data])
        all.to_csv('./nyc_energy_enriched.csv')

        i_date += relativedelta(months=1)

    print('[%s] End enriching...' % datetime.now())

[2019-04-26 06:43:47.660633] Start enriching...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=20.84 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enrich
Target paths: ['/year=2012/month=1/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2012/month=1/part-00004-tid-7816671341480880202-0b49e80b-f206-4731-ab5a-61d53f99b595-57.c000.snappy.parquet under container isdweatherdatacontainer
Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=126889.75 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=126911.75 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=3.39 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enrich
Target paths: ['/year=2012/month=2/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2012/month=2/part-00011-tid-7816671341480880202-

ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=43967.35 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=3.92 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enrich
Target paths: ['/year=2013/month=2/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2013/month=2/part-00011-tid-236689213593784421-264283c4-dffb-42b8-9bbf-d912ec6814af-77.c000.snappy.parquet under container isdweatherdatacontainer
Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=40129.77 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=40162.76 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=3.94 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enrich
Target paths: ['/year=2013/month=3/']
Looking for parquet files...
Reading them i

Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=40187.62 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=40220.08 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=3.2 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enrich
Target paths: ['/year=2014/month=3/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2014/month=3/part-00001-tid-9219175779481662582-3729dfdb-ab32-4767-b9b6-11d2d644c3ce-80.c000.snappy.parquet under container isdweatherdatacontainer
Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=50225.96 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=50243.52 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=3.88 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enric

Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=40982.18 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=41015.66 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=2.82 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enrich
Target paths: ['/year=2015/month=4/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2015/month=4/part-00011-tid-2198075741767757560-e3eb994e-d560-4dfc-941e-0aae74c8d9ed-103.c000.snappy.parquet under container isdweatherdatacontainer
Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=25953.82 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=25967.18 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=4.37 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enr

Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=43928.06 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=43948.75 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=4.6 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enrich
Target paths: ['/year=2016/month=5/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2016/month=5/part-00006-tid-6700213360605767691-4491b75c-f137-489b-b5df-4204b9326fda-111.c000.snappy.parquet under container isdweatherdatacontainer
Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=42744.48 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=42765.33 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=3.89 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enri

Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=47304.19 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=47322.27 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=16.77 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, enrich
Target paths: ['/year=2017/month=6/']
Looking for parquet files...
Reading them into Pandas dataframe...
Reading ISDWeather/year=2017/month=6/part-00010-tid-1321158002197267978-8e3eb092-4b7a-42de-97ee-e23297ed8955-128.c000.snappy.parquet under container isdweatherdatacontainer
Done.
ActivityCompleted: Activity=enrich, HowEnded=Success, Duration=46168.06 [ms]
ActivityCompleted: Activity=enrich_customer_data_with_agg, HowEnded=Success, Duration=46194.71 [ms]
ActivityStarted, get_enricher
ActivityCompleted: Activity=get_enricher, HowEnded=Success, Duration=2.94 [ms]
ActivityStarted, enrich_customer_data_with_agg
ActivityStarted, en

### The final result has been saved to ``"./nyc_energy_enriched.csv"``

In [9]:
all.head(5)

Unnamed: 0,timeStamp,demand,lat,long,datetime,precipDepth,temperature,precipTime,snowDepth
0,2012-01-01 00:00:00,4937.5,40.701,-74.009,2012-01-01 00:00:00,0.773585,7.665934,2.896226,0.0
1,2012-01-01 01:00:00,4752.1,40.701,-74.009,2012-01-01 01:00:00,0.773585,7.665934,2.896226,0.0
2,2012-01-01 02:00:00,4542.6,40.701,-74.009,2012-01-01 02:00:00,0.773585,7.665934,2.896226,0.0
3,2012-01-01 03:00:00,4357.7,40.701,-74.009,2012-01-01 03:00:00,0.773585,7.665934,2.896226,0.0
4,2012-01-01 04:00:00,4275.5,40.701,-74.009,2012-01-01 04:00:00,0.773585,7.665934,2.896226,0.0


<font color='blue'>The join success rate is 100%</font>

In [10]:
all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49205 entries, 0 to 270
Data columns (total 9 columns):
timeStamp      49205 non-null object
demand         49124 non-null float64
lat            49205 non-null float64
long           49205 non-null float64
datetime       49205 non-null datetime64[ns]
precipDepth    49205 non-null float64
temperature    49205 non-null float64
precipTime     49205 non-null float64
snowDepth      49205 non-null float64
dtypes: datetime64[ns](1), float64(7), object(1)
memory usage: 3.8+ MB


In [11]:
# EOF