# ASHRAE – Great Energy Predictor III

### Introduction:
We are using a dataset related to ASHRAE – Great Energy Predictor III (How much energy will a building consume?). The goal is to develop models from ASHRAE’s 2016 data in order to better understand metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters. The data comes from over 1,000 buildings over a one-year timeframe. The method chosen to solve the problem is Linear Regression.
    
### Objective

The objective of this notebook is to provide a prdictive models using LSTM to predict
How much energy will a building consume? 

The train dataset has our target variable called “meter reading” with datatype float, hence the task could be solved by RNN. The following methodology is used: 



### Outline

1.Data Understanding

2.Data Preparation

2.1 Merge tables

2.2 Droping columns and filling null value for column: 'air_temperature', 'wind_speed', 'precip_depth_1_hr', 'cloud_coverage'
2.3 Prepare train & test data for LSTM

3.Data Modeling    







Let's dive in!

### 1.Data Understanding

#### 1.1 Train data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from  datetime import datetime
import seaborn as sns



In [None]:
train=pd.read_csv('dl_competition/train.csv')
print(train.shape)
train.head()

In [None]:
train.dtypes

In [None]:
train.isna().summ()

- Convert timestamp  into time 

In [None]:
train['timestamp']=pd.to_datetime(train.timestamp)

In [None]:
train["month"]=train.timestamp.dt.month
train["day"]=train.timestamp.dt.day

In [None]:
train.describe()

In [None]:
plt.boxplot(train[train.meter==0].meter_reading)

In [None]:
train[train.meter==0].describe()

In [None]:
train[train.meter==0][train.meter_reading>70000]

In [None]:
train[train.meter==0][train.building_id==993][train.month == 10][train.day == 17]

In [None]:
building_meta[building_meta.building_id==993]

- 31921 & 79769. seems to be outliers since building993 has as primary_use education at 7pm and 1pm meter reading would be 0 

In [None]:
plt.boxplot(train[train.meter==1].meter_reading)

In [None]:
train[train.meter==1].describe()

In [None]:
train[train.meter==1][train.meter_reading>880000]

In [None]:
building_meta[building_meta.building_id==778]

In [None]:
plt.boxplot(train[train.meter==2].meter_reading)

In [None]:
plt.boxplot(train[train.meter==3].meter_reading)

In [None]:
train[train.building_id==0]['timestamp']

- We have energy measurement by building and by meter each 1hour along one year(2016)

#### Annual average of energy consumption by building 

In [None]:
dff=train[['building_id','meter','meter_reading']].groupby(['building_id','meter']).mean()

In [None]:
dff.reset_index(inplace=True)
dff.head()

In [None]:
dff.shape

In [None]:
dff.meter.hist()

- 1413 out of 1449 buildings are using electricity

#### Energy annual consumption

In [None]:
tags=['electricity', 'chilled water', 'steam      ', 'hot water']
d=train.groupby('meter').sum()["meter_reading"]
j=0
for i in d:
    print(' Annual consumption of %s \t : %f' % (tags[j],i))
    j+=1

In [None]:
train[train.building_id==403]

In [None]:
train.building_id.nunique()

In [None]:
t=set([train[train.building_id==i]['timestamp'].shape[0] for i in range(1449)])

In [None]:
t

In [None]:
max([train[train.building_id==i][train.meter==0]['timestamp'].shape[0] for i in range(1449)])

- max number of measurement across a year for one type of energy 8784

#### Building_id0 average consumption per month of electricity

In [None]:
train[train.building_id==0][train.meter==0][['month','meter_reading']].groupby('month').mean()

#### Visualize Building_id0 average consumption per month for each type of energy

In [None]:
g1 = train[train.building_id==0][train.meter==0][['month','meter_reading']].groupby('month').mean()

In [None]:
plt.scatter( range(1,13),g1, alpha=0.8, edgecolors='none', s=30)


#### Visualize Building_id1448 average consumption per month for each type of energy

In [None]:
g1 = train[train.building_id==1448][train.meter==0][['month','meter_reading']].groupby('month').mean()
plt.scatter( range(1,13),g1, alpha=0.8, edgecolors='none', s=30)

#### Visualize Building_id700 average consumption per month for each type of energy

In [None]:
g1 = train[train.building_id==700][train.meter==0][['month','meter_reading']].groupby('month').mean()

In [None]:
plt.scatter( range(1,13),g1, alpha=0.8, edgecolors='none', s=30)


#### 1.2 Building_meta_data

In [None]:
building_meta=pd.read_csv('dl_competition/building_metadata.csv')
print(building_meta.shape)
building_meta.head()

In [None]:
building_meta.isna().sum()

In [None]:
building_meta.describe()

In [None]:
building_meta[building_meta.year_built==2017]

In [None]:
train[train.building_id==363]

- building_id 363 is using electricity for 1 year before it has been built! 

Unconsistent year of built

In [None]:
building_meta.primary_use.unique()

In [None]:
building_meta.isna().sum()

In [None]:
#remove floor_count(not mentioned in half of building) && year_built (remove it )
#building_meta=building_meta[['site_id','building_id','primary_use','square_feet']]

#### 1.3 Weather data

In [None]:
weather=pd.read_csv('dl_competition/weather_train.csv')
print(weather.shape)
weather.head()

In [None]:
weather.describe()

In [None]:
weather.site_id.unique()

In [None]:
weather.isna().sum()

In [None]:
# remove precip_depth_1_hr since it has 75% of its values 0 && 50289 NaN

In [None]:
weather[ weather.air_temperature.isna() ][['site_id','timestamp']]

In [None]:
weather[ weather.cloud_coverage.isna() ][['site_id','timestamp']]

In [None]:
weather[ weather.precip_depth_1_hr.isna() ][['site_id','timestamp']]

In [None]:
weather[ weather.sea_level_pressure.isna() ][['site_id','timestamp']]

In [None]:
df=weather[ weather.sea_level_pressure.isna() ][['site_id','timestamp']]

In [None]:
df['timestamp']=pd.to_datetime(df.timestamp)
df['month']=df.timestamp.dt.month
df['day']=df.timestamp.dt.day

In [None]:
t=df.groupby(['site_id','month','day']).count()

In [None]:
t

In [None]:
t.timestamp.value_counts()

In [None]:
t=df[df.site_id == 5].groupby(['month','day']).count()

In [None]:
t.timestamp.value_counts()

In [None]:
#even if we have a lot of missing data but if we take a look at our data, we will figure out that we have missing data for some our, so in this case we would be able to fill missing data with near measurement
#we will use ffill: propagate last valid observation forward to next valid to fill nan in these cases
#we used df to see how many days in a site that we don't take measurments for sea pressure, in site_id 5 we have 355 days (year)

In [None]:
df[df.site_id==5]

In [None]:
# site 5 we don t know its sea pressure

In [None]:
building_meta[building_meta.site_id==5].primary_use.value_counts()

In [None]:
building_meta[building_meta.site_id==0].primary_use.value_counts()

In [None]:
building_meta[building_meta.site_id==1].primary_use.value_counts()

In [None]:
building_meta[building_meta.site_id==2].primary_use.value_counts()

In [None]:
corrmat=weather.corr()
fig,ax=plt.subplots(figsize=(12,10))
sns.heatmap(corrmat,annot=True,annot_kws={'size': 12})

In [None]:
# air_temperature & dew_temp are correlated we can remove one of them 

### 2.Data preparation

In [None]:
#remove 'year_built','floor_count'
building_meta=building_meta[['site_id', 'building_id', 'primary_use', 'square_feet']]
#remove 'dew_temperature', "precip_depth_1_hr"
weather=weather[['site_id', 'timestamp', 'air_temperature', 'cloud_coverage', 'sea_level_pressure', 'wind_direction', 'wind_speed']]

In [None]:
#fill NaNs with ffillna
weather['air_temperature'].fillna(method='ffill', inplace = True)
weather["cloud_coverage"].fillna(method='ffill', inplace = True)
weather['sea_level_pressure'].fillna(method='ffill', inplace = True)
weather['wind_speed'].fillna(method='ffill', inplace = True)
weather['wind_direction'].fillna(method='ffill', inplace = True)

### 3.Merge Building_meta_data & train

In [None]:
#afer merge don't forgrt to drom site_id since it is correlated with buildingid
merged_data=building_meta.merge(train,left_on='building_id',right_on='building_id',how='left')
merged_data=weather.merge(merged_data,left_on='site_id',right_on='site_id',how='left')
merged_data=merged_data.drop('site_id')

merged_data.shape
merged_data.head()

### 4.Modeling

In [None]:
dataset = merged_data.meter_reading.values #numpy.ndarray
dataset = dataset.astype('float32')
dataset = np.reshape(dataset, (-1, 1))
dataset = np.log1p(dataset)
train_size = int(len(dataset) * 0.70)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]

def create_dataset(dataset, look_back=1):
    X, Y = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        X.append(a)
        Y.append(dataset[i + look_back, 0])
    return np.array(X), np.array(Y)
    
look_back = 30
X_train, Y_train = create_dataset(train, look_back)
X_test, Y_test = create_dataset(test, look_back)

# reshape input to be [samples, time steps, features]
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))