## Description

### Problem

#### Objective:
Energy savings is one of the important area of focus our current world. Energy savings has two key elements:

* Forecasting future energy usage without improvements
* Forecasting energy use after a specific set of improvements have been implemented

Once we have implemented a set of improvements, the value of energy efficiency improvements can be challenging as there's no way to truly know **how much energy a building would have used without the improvements**. The best we can do is to build counterfactual models. 

We build these counterfactual models across four energy types:
* **chilled water**
* **electricity**
* **hot water**
* **steam**

based on historic usage rates and observed weather.

### Data

#### Dataset:
The dataset includes three years of hourly meter readings from over one thousand buildings at several different sites around the world.

##### train.csv

* `building_id` - Foreign key for the building metadata.
* `meter` - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
* `timestamp` - When the measurement was taken
* `meter_reading` - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

##### building_meta.csv

* `site_id` - Foreign key for the weather files.
* `building_id` - Foreign key for training.csv
* `primary_use` - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
* `square_feet` - Gross floor area of the building
* `year_built` - Year building was opened
* `floor_count` - Number of floors of the building

##### weather_[train/test].csv

Weather data from a meteorological station as close as possible to the site.

* `site_id`
* `air_temperature` - Degrees Celsius
* `cloud_coverage` - Portion of the sky covered in clouds, in oktas
* `dew_temperature` - Degrees Celsius
* `precip_depth_1_hr` - Millimeters
* `sea_level_pressure` - Millibar/hectopascals
* `wind_direction` - Compass direction (0-360)
* `wind_speed` - Meters per second

##### test.csv

The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.

* `row_id` - Row id for your submission file
* `building_id` - Building id code
* `meter` - The meter id code
* `timestamp` - Timestamps for the test data period

## Loading Data

#### TO DO:

* load data tables


* join train and metadata tables
* join train and weather tables

In [1]:
import pandas as pd
import os

In [None]:
%%time
building_metadata = pd.read_csv('data' + os.sep + 'building_metadata.csv')

In [None]:
%%time
train = pd.read_csv('data' + os.sep + 'train.csv')

In [None]:
%%time
test = pd.read_csv('data' + os.sep + 'test.csv')

In [None]:
# weather_train = pd.read_csv('data' + os.sep + 'weather_train.csv')

In [None]:
# weather_test = pd.read_csv('data' + os.sep + 'weather_test.csv')

In [None]:
# rename timestamps columns
# train.columns = ['building_id', 'meter', 'timestamp_meter', 'meter_reading']
# test.columns = ['row_id', 'building_id', 'meter', 'timestamp_meter']

# weather_train.columns = ['site_id', 'timestamp_weather', 'air_temperature', 'cloud_coverage', 'dew_temperature', \
#                         'precip_depth_1_hr', 'sea_level_pressure', 'wind_direction', 'wind_speed']
# weather_test.columns = ['site_id', 'timestamp_weather', 'air_temperature', 'cloud_coverage', 'dew_temperature', \
#                         'precip_depth_1_hr', 'sea_level_pressure', 'wind_direction', 'wind_speed']

In [None]:
test_data = building_metadata.copy()
train_data = building_metadata.copy()

In [None]:
test_data = test_data.join(test.set_index('building_id'), on='building_id', how='inner')
test_data.head()

In [None]:
test_data.shape

In [None]:
train_data = train_data.join(train.set_index('building_id'), on='building_id', how='inner')
train_data.head()

In [None]:
train_data.shape

In [None]:
# train_data = train_data.join(weather_train.set_index('site_id'), on='site_id', how='inner')

In [None]:
# test_data = test_data.join(weather_test.set_index('site_id'), on='site_id', how='inner')

## Preprocessing

### Data Cleaning

In [None]:
# import numpy as np

#### TO DO:

* NaNs counting
* How to fill in the blanks?


* Some columns processing (LabelEncoder)
* Split table into features and targets

In [None]:
# import matplotlib.pyplot as plt
# %matplotlib inline
# import seaborn as sns

In [None]:
# from sklearn.preprocessing import LabelEncoder

### Statistics

#### TO DO:

* correlation of features between themselves
* correlation of features with target values
* draw histograms, barcharts, ...


* drop unnecessary columns or join some features
* drop data outliers (data.column.quantile)

## Metrics

#### TO DO:

* check unbalancing!
* What metrics will we use and why?

In [None]:
# from sklearn.metrics import ...

## ML models

### Preprocessing

#### TO DO:

* split into train (80%) and test (val) (20%)
* OneHotEncoding for categorial features
* normalize (standardize) data

In [None]:
# from sklearn.model_selection import train_test_split

In [None]:
# from sklearn.preprocessing import OneHotEncoder
# from sklearn.preprocessing import StandardScaler

### Linear model

#### TO DO:

* Choose some linear models
* Find a good combination of hyperparameters via cross-validation
* plot dependency between score and some hyperparameter

In [None]:
# from sklearn.linear_model import ???LogisticRegression
#???? from sklearn.model_selection import StratifiedKFold
# from sklearn.model_selection import GridSearchCV

### Classifier model

#### TO DO:

* Choose some model (KNN or RandomForest)
* Find a good combination of hyperparameters via cross-validation
* plot dependency between score and some hyperparameter

In [None]:
# ???sklearn.neighbors import KNeighborsClassifier
# ???from sklearn.ensemble import RandomForestClassifier

## Analyzing of results

#### TO DO:

* Make `test.csv` files with our results
* send them into Kaggle in turn

### Comparison of models

#### TO DO:

* Compare results of Linear and Classifier models (take test scores from kaggle)
* Write conclusion