# 1. Tables preview and data organization

Loading database data and constructing tables with the information which will be used to construct/train the model.

In [1]:
import pandasql as ps
import pandas as pd



In [2]:
buildings_metadata_full = pd.read_csv('./data/building_metadata.csv')
buildings_metadata_full = buildings_metadata_full.dropna()
buildings_metadata_full = buildings_metadata_full.reset_index(drop=True)
weather_train_full = pd.read_csv('./data/weather_train.csv')
weather_test_full = pd.read_csv('./data/weather_test.csv')
train_full = pd.read_csv('./data/train.csv')
test_full = pd.read_csv('./data/test.csv')
leak_df = pd.read_csv('./data/leak_df.csv')
leaked_test_target = pd.read_csv('./data/leaked_test_target.csv')

------

## 1.1 Tables Preview

### 1.1.1 Building Metadata.

We start by looking at the building metadata table, which appears to have the following attributes:

- site_id.
- building_id.
- primary_use.
- square_feet.
- year_built.
- floor_count.



In [3]:
buildings_metadata_full.head(2)

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count
0,1,107,Education,97532,2005.0,10.0
1,1,108,Education,81580,1913.0,5.0


We then group it by `primary_use` and count how many of each time we have on this dataset.

In [4]:
q = ps.sqldf("select primary_use, count(primary_use) from buildings_metadata_full group by primary_use")
q.head()

Unnamed: 0,primary_use,count(primary_use)
0,Education,145
1,Entertainment/public assembly,27
2,Healthcare,1
3,Lodging/residential,14
4,Manufacturing/industrial,3


From the table we can see that the main content of the dataset is on Educational facilities. More data visualizing could be done on this table, but it does not fit the purpose as we will occupy the information of only one single building to build a simple model.

### 1.1.2  Wheather data.

We follow looking at the information cointained on the weather train and test set. The following attributes are observed:

- site_id.
- timestamp.
- air_temperature.
- cloud_coverage.
- dew_temperature.
- precip_depth_1_hr.
- sea_level_pressure.
- wind_direction.
- wind_speed.

Therefore, we're looking at time series with some hourly weather conditions.

In [5]:
weather_train_full.head(1)

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.7,0.0,0.0


In [6]:
weather_train_full.tail(1)

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
139772,15,2016-12-31 23:00:00,1.7,,-5.6,-1.0,1008.5,180.0,8.8


------

In [7]:
weather_test_full.head(1)

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,2017-01-01 00:00:00,17.8,4.0,11.7,,1021.4,100.0,3.6


In [8]:
weather_test_full.tail(1)

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
277242,15,2018-12-31 23:00:00,3.3,,2.2,20.0,1014.7,140.0,5.1


------

From the head and tail of both tables, we observe that the weather train set has hourly weather data for the year 2016 and the train set, for the years 2017 and 2018. They hold as well each weather data time series for 15 different sites.

Note that there are some missing values that will be cleaned up later.

### 1.1.3 Train set and Test set.

We look now at the information cointained on the train and test set. The following attributes are observed:

- building_id.
- meter.
- timestamp.
- meter_reading.

In [9]:
train_full.head(1)

Unnamed: 0,building_id,meter,timestamp,meter_reading
0,0,0,2016-01-01 00:00:00,0.0


In [10]:
test_full.head(1)

Unnamed: 0,row_id,building_id,meter,timestamp
0,0,0,0,2017-01-01 00:00:00


Note that the test set does not include the meter_reading. Therefore, we'll need to know the target value to complete the required data to validate the training. For that we have the following two tables.

### 1.1.4 Leak.

The name of this section is 'Leak', because this data was not originally made public at the moment of the contest anouncement, but rather leaked afterwards as the actual meter_readings measured during the years 2017 and 2018. However this information is only available for some buildings and some kind of meters. After reviewing the content of this two tables, we'll join the target values to the test set to have in the same table the target value.

We look at the information cointained on the leaked table. The following attributes are observed:

- building_id.
- meter.
- timestamp.
- meter_reading.

In [11]:
leak_df.head(1)

Unnamed: 0,building_id,meter,meter_reading,timestamp
0,0,0,0.0,2016-01-01 00:00:00


In the next section we will delimit, filter and clean our dataset. We'll also define in a clearer manner the objectives of the present work.

 ------

## 1.2 Organizing and Selecting Data

This time we will develop a simple model just for electricity consumption forecasting of one single building. We'll begin selecting one of the buildings from those for which we have the test target values. We'll filter out the rest of the data (mainly weather) according to this building site_id, as well for the meter readings considering only the meter marked as 0 as it represents the consumed electricity.

### 1.2.1 Data Delimiting and Filtering
As stated above, we need to select a building for which we have all the necesary data. We'll start by determining which buildings are included on the buildings metadata table and on this leaked test target value table. Also this building must necesarily have the meter readings with the meter 0.

In [12]:
intersection = ps.sqldf("select y.site_id, x.building_id, y.primary_use from leak_df as x join buildings_metadata_full as y on x.building_id = y.building_id  where x.meter = '0' group by y.site_id, x.building_id, y.primary_use")
intersection

Unnamed: 0,site_id,building_id,primary_use
0,1,107,Education
1,1,108,Education
2,1,109,Education
3,1,110,Education
4,1,111,Education
...,...,...,...
113,4,650,Education
114,4,652,Education
115,4,653,Education
116,4,654,Education


We'll avoid selecting a building which primary use is Educational, because there are stationary periods while on vacations. We'll preferably choose a building with a primary use which continously used. 

In [13]:
check1 = ps.sqldf("select primary_use from intersection group by primary_use")
check1

Unnamed: 0,primary_use
0,Education
1,Entertainment/public assembly
2,Lodging/residential
3,Office
4,Parking
5,Public services
6,Technology/science
7,Utility


We opt for the Lodging/residential primary use. Arbitrarily we'll select one of the available buildings. We'll know the quality of the data once we visualizing in the second section of this notebook.

In [14]:
check = ps.sqldf("select * from intersection where primary_use = 'Lodging/residential'")
check

Unnamed: 0,site_id,building_id,primary_use
0,1,128,Lodging/residential
1,1,129,Lodging/residential
2,1,130,Lodging/residential
3,1,131,Lodging/residential
4,1,132,Lodging/residential
5,1,133,Lodging/residential
6,1,134,Lodging/residential
7,1,135,Lodging/residential
8,1,136,Lodging/residential
9,4,614,Lodging/residential


Based on the information revealed on the query above, we'll work with the buildin with building_id = 133 and site_id = 1. We proceed to filter it out from the leaked test target table.

We display the buildings full specs.

In [15]:
building_metadata = ps.sqldf("select * from buildings_metadata_full where building_id = '133'")
building_metadata

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count
0,1,133,Lodging/residential,64723,1960.0,8.0


### 1.2.2 Data Filtering
We now proceed to extract only the data that we'll be needing from all the tables, starting with the leaked test table.

#### 1.2.2.1 Test Table

In [16]:
test = ps.sqldf("select * from leak_df where building_id = '133' and meter = '0'")
test = ps.sqldf("select * from test where (timestamp like '%2017%' or timestamp like '%2018%')")
test

Unnamed: 0,building_id,meter,meter_reading,timestamp
0,133,0,23.3,2017-01-01 00:00:00
1,133,0,50.1,2017-01-01 01:00:00
2,133,0,48.4,2017-01-01 02:00:00
3,133,0,49.9,2017-01-01 03:00:00
4,133,0,50.0,2017-01-01 04:00:00
...,...,...,...,...
17515,133,0,0.0,2018-12-31 19:00:00
17516,133,0,0.0,2018-12-31 20:00:00
17517,133,0,0.0,2018-12-31 21:00:00
17518,133,0,0.0,2018-12-31 22:00:00


#### 1.2.2.2 Train Table

Now we filter the train table for the building_id = 133.

In [17]:
train = ps.sqldf("select * from train_full where building_id = '133' and meter = '0'")
train

Unnamed: 0,building_id,meter,timestamp,meter_reading
0,133,0,2016-01-01 00:00:00,17.7
1,133,0,2016-01-01 01:00:00,37.1
2,133,0,2016-01-01 02:00:00,37.8
3,133,0,2016-01-01 03:00:00,35.1
4,133,0,2016-01-01 04:00:00,27.5
...,...,...,...,...
8779,133,0,2016-12-31 19:00:00,60.9
8780,133,0,2016-12-31 20:00:00,56.5
8781,133,0,2016-12-31 21:00:00,54.2
8782,133,0,2016-12-31 22:00:00,52.1


Note that from the shape of th table we can tell that 2016 was a leap year.

#### 1.2.2.3 Weather Tables

We filter out weather data from site_id = 1. We do the same for the weather_train and weather_test set. At this point we can drop the site_id column as it would only ada redundance.

In [18]:
weather_train = ps.sqldf("select timestamp, air_temperature, cloud_coverage, dew_temperature, precip_depth_1_hr, sea_level_pressure, wind_direction, wind_speed from weather_train_full where site_id='1'")
weather_train

Unnamed: 0,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,2016-01-01 00:00:00,3.8,,2.4,,1020.9,240.0,3.1
1,2016-01-01 01:00:00,3.7,0.0,2.4,,1021.6,230.0,2.6
2,2016-01-01 02:00:00,2.6,0.0,1.9,,1021.9,0.0,0.0
3,2016-01-01 03:00:00,2.0,0.0,1.2,,1022.3,170.0,1.5
4,2016-01-01 04:00:00,2.3,0.0,1.8,,1022.7,110.0,1.5
...,...,...,...,...,...,...,...,...
8758,2016-12-31 19:00:00,8.1,,6.5,,1027.5,220.0,3.6
8759,2016-12-31 20:00:00,7.2,,6.1,,1026.9,220.0,4.1
8760,2016-12-31 21:00:00,6.9,,5.8,,1026.2,220.0,4.6
8761,2016-12-31 22:00:00,6.9,,6.2,,1025.4,190.0,3.1


From the shape of the table we can tell that there are some missing measurements (we should have 8784 as in the train table).

In [19]:
weather_test = ps.sqldf("select timestamp, air_temperature, cloud_coverage, dew_temperature, precip_depth_1_hr, sea_level_pressure, wind_direction, wind_speed from weather_test_full where site_id='1'")
weather_test

Unnamed: 0,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,2017-01-01 00:00:00,6.7,,5.2,,1024.1,200.0,5.1
1,2017-01-01 01:00:00,6.2,,5.1,,1022.7,210.0,3.6
2,2017-01-01 02:00:00,6.0,,4.9,,1021.9,210.0,4.6
3,2017-01-01 03:00:00,5.7,,4.8,,1020.7,200.0,3.6
4,2017-01-01 04:00:00,5.6,,4.5,,1019.6,210.0,4.1
...,...,...,...,...,...,...,...,...
17282,2018-12-31 19:00:00,9.0,,5.3,,1035.3,270.0,4.1
17283,2018-12-31 20:00:00,8.9,,5.1,,1035.2,270.0,4.1
17284,2018-12-31 21:00:00,9.1,,5.1,,1035.3,290.0,3.6
17285,2018-12-31 22:00:00,9.0,,5.0,,1035.0,280.0,3.6


We have now finished filtering out the data that we will use. We now have to clean it.
We export the generated tables to .csv files.

### 1.2.3 Handling Missing Data
As we mentioned earlier, the test and train tables appear to be complete (no missing rows). It is not the same with the weather data. We'll need to fill in those gaps.
Then we'll have to drop the attributes which don't give any more information and finally, interpolate missing data.

#### 1.2.3.1 Completing Missing Rows

First we parse the date to turn the dataframe into a series. Then it will be easier to fill up the missing rows.

In [20]:
train['timestamp'] = pd.to_datetime(train['timestamp'], infer_datetime_format=True)
train = train.set_index('timestamp')
test['timestamp'] = pd.to_datetime(test['timestamp'], infer_datetime_format=True)
test = test.set_index('timestamp')
weather_train['timestamp'] = pd.to_datetime(weather_train['timestamp'], infer_datetime_format=True)
weather_train = weather_train.set_index('timestamp')
weather_test['timestamp'] = pd.to_datetime(weather_test['timestamp'], infer_datetime_format=True)
weather_test = weather_test.set_index('timestamp')

We check the actual length from the different dataframes.

In [21]:
print('Number of hours of the different dataframes...')
print('train: {}hrs\ntest: {}hrs\nweather_train: {}hrs\nweather_test: {}hrs\n'.format(train.shape[0], test.shape[0], weather_train.shape[0], weather_test.shape[0]))

Number of hours of the different dataframes...
train: 8784hrs
test: 17520hrs
weather_train: 8763hrs
weather_test: 17287hrs



We observe that the total hours of the train and test set correspond to the hours that 1 and 2 years correspondingly have `[24X366 = 8784(leapyear) and 24X365X2 = 17520]`. That's why we can detect that the weather train and test set have missing rows. With the following code we set a hourly frequency to the tables and it automatically fills up the missing rows with NAN values.

In [22]:
train = train.asfreq('H') #'H' specifies hourly resolution. 
test = test.asfreq('H')
weather_train = weather_train.asfreq('H')
weather_test = weather_test.asfreq('H')

We now recheck the lenght of the dataframes.

In [23]:
print('Number of hours of the different dataframes...')
print('train: {}hrs\ntest: {}hrs\nweather_train: {}hrs\nweather_test: {}hrs\n'.format(train.shape[0], test.shape[0], weather_train.shape[0], weather_test.shape[0]))

Number of hours of the different dataframes...
train: 8784hrs
test: 17520hrs
weather_train: 8784hrs
weather_test: 17520hrs



#### 1.2.3.2 Analysing NAN Proportion Per Attribute And Dropping (Fast) Empty Columns

We now count the missing values on each attribute of each dataframe. The number of Flase values counted is the nomber of the values that are not NAN, ie. that are in the tables.

In [24]:
test.isnull().value_counts()

building_id  meter  meter_reading
False        False  False            17520
dtype: int64

In [25]:
train.isnull().value_counts()

building_id  meter  meter_reading
False        False  False            8784
dtype: int64

The train and test set are therefore complete and full. We'll later just drop the meter and building_id attributes.

In [26]:
for attribute in weather_train.columns:
    print(weather_train[attribute].isnull().value_counts())

False    8762
True       22
Name: air_temperature, dtype: int64
True     7083
False    1701
Name: cloud_coverage, dtype: int64
False    8762
True       22
Name: dew_temperature, dtype: int64
True    8784
Name: precip_depth_1_hr, dtype: int64
False    8711
True       73
Name: sea_level_pressure, dtype: int64
False    8760
True       24
Name: wind_direction, dtype: int64
False    8763
True       21
Name: wind_speed, dtype: int64


For the weather train set, we see that for `cloud_coverage` the `80.6% is missing`. Therefore, we'll drop this column. We also note that for `precip_depth_1_hr the 100% is missing`; we'll drop it as well. The other attributes have `less than 1% missing values`. We'll keep the columns and interpolate the data with a 2nd degree polinome.

In [27]:
for attribute in weather_test.columns:
    print(weather_test[attribute].isnull().value_counts())

False    17265
True       255
Name: air_temperature, dtype: int64
True     13906
False     3614
Name: cloud_coverage, dtype: int64
False    17265
True       255
Name: dew_temperature, dtype: int64
True    17520
Name: precip_depth_1_hr, dtype: int64
False    17222
True       298
Name: sea_level_pressure, dtype: int64
False    17282
True       238
Name: wind_direction, dtype: int64
False    17287
True       233
Name: wind_speed, dtype: int64


For the weather test set, we see that for `cloud_coverage` the `79.4% of values are missing`. Therefore, we'll drop this column. We also note that for `precip_depth_1_hr the 100% is missing`; we'll drop it as well. The other attributes have `less than 2% missing values`. We'll keep the columns and interpolate the data with a 2nd degree polynome.

In [28]:
train = ps.sqldf("select timestamp, meter_reading as electricity from train")
test = ps.sqldf("select timestamp, meter_reading as electricity from test")
weather_train = ps.sqldf("select timestamp, air_temperature, dew_temperature, sea_level_pressure, wind_direction, wind_speed from weather_train")
weather_test = ps.sqldf("select timestamp, air_temperature, dew_temperature, sea_level_pressure, wind_direction, wind_speed from weather_test")

#### 1.2.3.3 Join of Weather Train and Test Sets and Train and Test Target Sets and Missing Data  2nd Order Interpolation

Join of the train target value table and train weather data as well for the test data. Then de parsing of the date and finally the missing values interpolation.

In [29]:
train_set = ps.sqldf("select x.timestamp, air_temperature, dew_temperature, sea_level_pressure, wind_direction, wind_speed, electricity from train as x join weather_train as y on x.timestamp = y.timestamp")
test_set = ps.sqldf("select x.timestamp, air_temperature, dew_temperature, sea_level_pressure, wind_direction, wind_speed, electricity from test as x join weather_test as y on x.timestamp = y.timestamp")

In [30]:
train_set['timestamp'] = pd.to_datetime(train_set['timestamp'], infer_datetime_format=True)
train_set = train_set.set_index('timestamp')
test_set['timestamp'] = pd.to_datetime(test_set['timestamp'], infer_datetime_format=True)
test_set = test_set.set_index('timestamp')

In [31]:
train_set = train_set.interpolate(method='polynomial', order=2)
test_set = test_set.interpolate(method='polynomial', order=2)

Checking if the data was correctly completed

In [32]:
for attribute in test_set.columns:
    print(test_set[attribute].isnull().value_counts())
    print(train_set[attribute].isnull().value_counts())

False    17520
Name: air_temperature, dtype: int64
False    8784
Name: air_temperature, dtype: int64
False    17520
Name: dew_temperature, dtype: int64
False    8784
Name: dew_temperature, dtype: int64
False    17520
Name: sea_level_pressure, dtype: int64
False    8784
Name: sea_level_pressure, dtype: int64
False    17520
Name: wind_direction, dtype: int64
False    8784
Name: wind_direction, dtype: int64
False    17520
Name: wind_speed, dtype: int64
False    8784
Name: wind_speed, dtype: int64
False    17520
Name: electricity, dtype: int64
False    8784
Name: electricity, dtype: int64


We can see that the data has been correctly completed.

Finally, we write the formated data to `.csv`.

In [33]:
train_set.to_csv('./data/clean/train_set.csv')
test_set.to_csv('./data/clean/test_set.csv')

------