## Description

### Problem

#### Objective:
Energy savings is one of the important area of focus our current world. Energy savings has two key elements:

* Forecasting future energy usage without improvements
* Forecasting energy use after a specific set of improvements have been implemented

Once we have implemented a set of improvements, the value of energy efficiency improvements can be challenging as there's no way to truly know **how much energy a building would have used without the improvements**. The best we can do is to build counterfactual models. 

We build these counterfactual models across four energy types:
* **chilled water**
* **electricity**
* **hot water**
* **steam**

based on historic usage rates and observed weather.

### Data

#### Dataset:
The dataset includes three years of hourly meter readings from over one thousand buildings at several different sites around the world.

##### train.csv

* `building_id` - Foreign key for the building metadata.
* `meter` - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
* `timestamp` - When the measurement was taken
* `meter_reading` - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

##### building_meta.csv

* `site_id` - Foreign key for the weather files.
* `building_id` - Foreign key for training.csv
* `primary_use` - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
* `square_feet` - Gross floor area of the building
* `year_built` - Year building was opened
* `floor_count` - Number of floors of the building

##### weather_[train/test].csv

Weather data from a meteorological station as close as possible to the site.

* `site_id`
* `air_temperature` - Degrees Celsius
* `cloud_coverage` - Portion of the sky covered in clouds, in oktas
* `dew_temperature` - Degrees Celsius
* `precip_depth_1_hr` - Millimeters
* `sea_level_pressure` - Millibar/hectopascals
* `wind_direction` - Compass direction (0-360)
* `wind_speed` - Meters per second

##### test.csv

The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.

* `row_id` - Row id for your submission file
* `building_id` - Building id code
* `meter` - The meter id code
* `timestamp` - Timestamps for the test data period

## Loading Data

#### TO DO:

* load data tables


* join train and metadata tables
* join train and weather tables

In [2]:
import pandas as pd
import os

In [3]:
%%time
building_metadata = pd.read_csv('data' + os.sep + 'building_metadata.csv')

Wall time: 302 ms


In [4]:
%%time
train = pd.read_csv('data' + os.sep + 'train.csv')

Wall time: 14.1 s


In [5]:
%%time
test = pd.read_csv('data' + os.sep + 'test.csv')

Wall time: 1min 17s


In [None]:
# weather_train = pd.read_csv('data' + os.sep + 'weather_train.csv')

In [None]:
# weather_test = pd.read_csv('data' + os.sep + 'weather_test.csv')

In [None]:
# rename timestamps columns
# train.columns = ['building_id', 'meter', 'timestamp_meter', 'meter_reading']
# test.columns = ['row_id', 'building_id', 'meter', 'timestamp_meter']

# weather_train.columns = ['site_id', 'timestamp_weather', 'air_temperature', 'cloud_coverage', 'dew_temperature', \
#                         'precip_depth_1_hr', 'sea_level_pressure', 'wind_direction', 'wind_speed']
# weather_test.columns = ['site_id', 'timestamp_weather', 'air_temperature', 'cloud_coverage', 'dew_temperature', \
#                         'precip_depth_1_hr', 'sea_level_pressure', 'wind_direction', 'wind_speed']

In [6]:
test_data = building_metadata.copy()
train_data = building_metadata.copy()

In [7]:
test_data = test_data.join(test.set_index('building_id'), on='building_id', how='inner')
test_data.head()

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count,row_id,meter,timestamp
0,0,0,Education,7432,2008.0,,0,0,2017-01-01 00:00:00
0,0,0,Education,7432,2008.0,,129,0,2017-01-01 01:00:00
0,0,0,Education,7432,2008.0,,258,0,2017-01-01 02:00:00
0,0,0,Education,7432,2008.0,,387,0,2017-01-01 03:00:00
0,0,0,Education,7432,2008.0,,516,0,2017-01-01 04:00:00


In [8]:
test_data.shape

(41697600, 9)

In [9]:
train_data = train_data.join(train.set_index('building_id'), on='building_id', how='inner')
train_data.head()

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count,meter,timestamp,meter_reading
0,0,0,Education,7432,2008.0,,0,2016-01-01 00:00:00,0.0
0,0,0,Education,7432,2008.0,,0,2016-01-01 01:00:00,0.0
0,0,0,Education,7432,2008.0,,0,2016-01-01 02:00:00,0.0
0,0,0,Education,7432,2008.0,,0,2016-01-01 03:00:00,0.0
0,0,0,Education,7432,2008.0,,0,2016-01-01 04:00:00,0.0


In [10]:
train_data.shape

(20216100, 9)

In [None]:
# train_data = train_data.join(weather_train.set_index('site_id'), on='site_id', how='inner')

In [None]:
# test_data = test_data.join(weather_test.set_index('site_id'), on='site_id', how='inner')

## Preprocessing

### Data Cleaning

In [None]:
# import numpy as np

#### TO DO:

* NaNs counting
* How to fill in the blanks?


* Some columns processing (LabelEncoder)
* Split table into features and targets

In [None]:
# import matplotlib.pyplot as plt
# %matplotlib inline
# import seaborn as sns

In [None]:
# from sklearn.preprocessing import LabelEncoder

### Statistics

#### TO DO:

* correlation of features between themselves
* correlation of features with target values
* draw histograms, barcharts, ...


* drop unnecessary columns or join some features
* drop data outliers (data.column.quantile)

## Metrics

#### TO DO:

* check unbalancing!
* What metrics will we use and why?

In [None]:
# from sklearn.metrics import ...

## ML models

### Preprocessing

#### TO DO:

* ~~split into train (80%) and test (val) (20%)~~
* ~~OneHotEncoding for categorial features~~
* ~~normalize (standardize) data~~


In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_log_error


random_state=1

### TO DO:
* Get categorical indices and numerical (non-binary)

In [None]:
pipe_ridge = Pipeline(steps=[
    ('feature_processing', FeatureUnion(transformer_list=[
        ('encoding', Pipeline(steps = [('selecting', FunctionTransformer(lambda X : X[:, categorical_indices])),
                                      ('encoding', OneHotEncoder())
                                      ])),
        ('scaling', Pipeline(steps = [('selecting', FunctionTransformer(lambda X : X[:, numerical_data_indices])),
                                     ('scaling', StandardScaler())
                                     ]))
    ]),
    'model', Ridge(random_state=random_state))
])

In [None]:
pipe_lasso = Pipeline(steps=[
    ('feature_processing', FeatureUnion(transformer_list=[
        ('encoding', Pipeline(steps = [('selecting', FunctionTransformer(lambda X : X[:, categorical_indices])),
                                      ('encoding', OneHotEncoder())
                                      ])),
        ('scaling', Pipeline(steps = [('selecting', FunctionTransformer(lambda X : X[:, numerical_data_indices])),
                                     ('scaling', StandardScaler())
                                     ]))
    ]),
    'model', Lasso(random_state=random_state))
])

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=random_state)

### Linear model

#### TO DO:

* Choose some linear models
* Find a good combination of hyperparameters via cross-validation
* plot dependency between score and some hyperparameter

In [None]:
param_grid = {'model__alpha': np.logspace(-20, 20, 41)}
grid_search_ridge = GridSearchCV(pipe_ridge, param_grid, cv=3, n_jobs=2, scoring='msle') 
grid_search_ridge.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % grid_search_ridge.best_score_)
print(grid_search_ridge.best_params_)

In [None]:
results_ridge = grid_search_ridge.cv_results_
plt.figure(figsize=[12., 9.])
plt.title("GridSearchCV for Ridge Regression")
plt.xlabel("Regularization coefficient")
plt.ylabel("MSLE")

x_axis = np.array(results_ridge['param_alpha'].data, dtype=float)
y_axis = results_ridge['mean_test_score']
plt.plot(x_axis, y_axis)

In [None]:
grid_search_lasso = GridSearchCV(pipe_lasso, param_grid, cv=3, n_jobs=2, scoring='msle') 
grid_search_lasso.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % grid_search_lasso.best_score_)
print(grid_search_lasso.best_params_)

In [None]:
results = grid_search_lasso.cv_results_
plt.figure(figsize=[12., 9.])
plt.title("GridSearchCV for Lasso Regression")
plt.xlabel("Regularization coefficient")
plt.ylabel("MSLE")

x_axis = np.array(results_lasso['param_alpha'].data, dtype=float)
y_axis = results_lasso['mean_test_score']
plt.plot(x_axis, y_axis)

In [None]:
fin_model = Lasso(alpha = grid_search_lasso.best_params_[alpha], random_state=random_state) 
if grid_search_lasso.best_score_ > grid_search_ridge.best_score_ else 
Ridge(alpha = grid_search_ridge.best_params_[alpha], random_state=random_state)

fin_model.fit(X_train, y_train)
y_pred = fin_model.predict(X_val)
mean_squared_log_error.score(y_pred, y_test)

## Analyzing of results

#### TO DO:

* Make `test.csv` files with our results
* send them into Kaggle in turn