## Description

### Problem

#### Objective:
Energy savings is one of the important area of focus our current world. Energy savings has two key elements:

* Forecasting future energy usage without improvements
* Forecasting energy use after a specific set of improvements have been implemented

Once we have implemented a set of improvements, the value of energy efficiency improvements can be challenging as there's no way to truly know **how much energy a building would have used without the improvements**. The best we can do is to build counterfactual models. 

We build these counterfactual models across four energy types:
* **chilled water**
* **electricity**
* **hot water**
* **steam**

based on historic usage rates and observed weather.

### Data

#### Dataset:
The dataset includes three years of hourly meter readings from over one thousand buildings at several different sites around the world.

##### train.csv

* `building_id` - Foreign key for the building metadata.
* `meter` - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
* `timestamp` - When the measurement was taken
* `meter_reading` - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

##### building_meta.csv

* `site_id` - Foreign key for the weather files.
* `building_id` - Foreign key for training.csv
* `primary_use` - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
* `square_feet` - Gross floor area of the building
* `year_built` - Year building was opened
* `floor_count` - Number of floors of the building

##### weather_[train/test].csv

Weather data from a meteorological station as close as possible to the site.

* `site_id`
* `air_temperature` - Degrees Celsius
* `cloud_coverage` - Portion of the sky covered in clouds, in oktas
* `dew_temperature` - Degrees Celsius
* `precip_depth_1_hr` - Millimeters
* `sea_level_pressure` - Millibar/hectopascals
* `wind_direction` - Compass direction (0-360)
* `wind_speed` - Meters per second

##### test.csv

The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.

* `row_id` - Row id for your submission file
* `building_id` - Building id code
* `meter` - The meter id code
* `timestamp` - Timestamps for the test data period

## Loading Data

#### TO DO:

* load data tables


* join train and metadata tables
* join train and weather tables

In [1]:
import pandas as pd
import os

In [2]:
%%time
building_metadata = pd.read_csv('data' + os.sep + 'building_metadata.csv')

Wall time: 66.9 ms


In [3]:
%%time
train_data = pd.read_csv('data' + os.sep + 'train.csv')

Wall time: 31.6 s


In [None]:
# %%time
# test = pd.read_csv('data' + os.sep + 'test.csv')

In [None]:
# weather_train = pd.read_csv('data' + os.sep + 'weather_train.csv')

In [None]:
# weather_test = pd.read_csv('data' + os.sep + 'weather_test.csv')

In [None]:
# rename timestamps columns
# train.columns = ['building_id', 'meter', 'ti   mestamp_meter', 'meter_reading']
# test.columns = ['row_id', 'building_id', 'meter', 'timestamp_meter']

# weather_train.columns = ['site_id', 'timestamp_weather', 'air_temperature', 'cloud_coverage', 'dew_temperature', \
#                         'precip_depth_1_hr', 'sea_level_pressure', 'wind_direction', 'wind_speed']
# weather_test.columns = ['site_id', 'timestamp_weather', 'air_temperature', 'cloud_coverage', 'dew_temperature', \
#                         'precip_depth_1_hr', 'sea_level_pressure', 'wind_direction', 'wind_speed']

In [None]:
# test_data = building_metadata.copy()
# test_data = test_data.join(test.set_index('building_id'), on='building_id', how='inner')
# test_data.head()

In [None]:
# test_data.shape

In [4]:
train_data = train_data.join(building_metadata.set_index('building_id'), on='building_id', how='inner')
train_data.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading,site_id,primary_use,square_feet,year_built,floor_count
0,0,0,2016-01-01 00:00:00,0.0,0,Education,7432,2008.0,
2301,0,0,2016-01-01 01:00:00,0.0,0,Education,7432,2008.0,
4594,0,0,2016-01-01 02:00:00,0.0,0,Education,7432,2008.0,
6893,0,0,2016-01-01 03:00:00,0.0,0,Education,7432,2008.0,
9189,0,0,2016-01-01 04:00:00,0.0,0,Education,7432,2008.0,


In [5]:
train_data.shape

(20216100, 9)

Create a copy of train data for EDA

In [None]:
train_copy = train.copy()

In [6]:
train_data_electricity = train_data[train_data['meter'] == 0]
train_data_chilledWater = train_data[train_data['meter'] == 1]
train_data_steam = train_data[train_data['meter'] == 2]
train_data_hotWater = train_data[train_data['meter'] == 3]

In [7]:
train_data_electricity.shape

(12060910, 9)

In [8]:
train_data_chilledWater.shape

(4182440, 9)

In [9]:
train_data_steam.shape

(2708713, 9)

In [10]:
train_data_hotWater.shape

(1264037, 9)

## Preprocessing

### Data Cleaning

In [11]:
import numpy as np

#### TO DO:

* NaNs counting
* How to fill the blanks?


* Some columns processing (LabelEncoder/...)
* Preprocessing for each building
* Split table into features and targets

In [12]:
def ConvertDate(train_data):
    # Convert date to datetime format
    train_data['timestamp'] = pd.to_datetime(train_data['timestamp'])
    
    # Extract and store year, month, day, hour
    train_data['year'] = train_data['timestamp'].dt.year
    train_data['month'] = train_data['timestamp'].dt.month
    train_data['day'] = train_data['timestamp'].dt.day
    train_data['hour'] = train_data['timestamp'].dt.hour
    
    train_data.drop(['timestamp'], axis=1, inplace=True)

In [13]:
ConvertDate(train_data_electricity)
ConvertDate(train_data_chilledWater)
ConvertDate(train_data_steam)
ConvertDate(train_data_hotWater)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value i

In [14]:
train_data_electricity.head()

Unnamed: 0,building_id,meter,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,year,month,day,hour
0,0,0,0.0,0,Education,7432,2008.0,,2016,1,1,0
2301,0,0,0.0,0,Education,7432,2008.0,,2016,1,1,1
4594,0,0,0.0,0,Education,7432,2008.0,,2016,1,1,2
6893,0,0,0.0,0,Education,7432,2008.0,,2016,1,1,3
9189,0,0,0.0,0,Education,7432,2008.0,,2016,1,1,4


In [15]:
train_data_chilledWater.head()

Unnamed: 0,building_id,meter,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,year,month,day,hour
3172286,7,1,1857.26,0,Education,121074,1989.0,,2016,2,29,9
3174385,7,1,2659.25,0,Education,121074,1989.0,,2016,2,29,10
3176488,7,1,2912.51,0,Education,121074,1989.0,,2016,2,29,11
3178590,7,1,3039.15,0,Education,121074,1989.0,,2016,2,29,12
3180681,7,1,3207.99,0,Education,121074,1989.0,,2016,2,29,13


In [16]:
train_data_steam.head()

Unnamed: 0,building_id,meter,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,year,month,day,hour
894,745,2,0.0,6,Education,13512,,,2016,1,1,0
3193,745,2,0.0,6,Education,13512,,,2016,1,1,1
5488,745,2,0.0,6,Education,13512,,,2016,1,1,2
7786,745,2,0.0,6,Education,13512,,,2016,1,1,3
10081,745,2,0.0,6,Education,13512,,,2016,1,1,4


In [17]:
train_data_hotWater.head()

Unnamed: 0,building_id,meter,meter_reading,site_id,primary_use,square_feet,year_built,floor_count,year,month,day,hour
105,106,3,0.0,1,Education,5374,,4.0,2016,1,1,0
2406,106,3,10.0,1,Education,5374,,4.0,2016,1,1,1
4699,106,3,10.0,1,Education,5374,,4.0,2016,1,1,2
6998,106,3,10.0,1,Education,5374,,4.0,2016,1,1,3
9294,106,3,0.0,1,Education,5374,,4.0,2016,1,1,4


In [18]:
from sklearn.preprocessing import LabelEncoder

def CreateMeanMeterReading(train_data, buildings_number):
    new_columns = list(train_data.columns)
    new_columns.append('meter_reading_mean')

    train = pd.DataFrame(columns=new_columns)

    building_ids = train_data['building_id'].unique()[1:buildings_number]
    train_data_cutted = train_data[train_data['building_id'] == train_data['building_id'].unique()[0]]
    for building_id in building_ids:
        train_data_cutted = train_data_cutted.append(train_data[train_data['building_id'] == building_id], ignore_index=True)
    
    for building_id in train_data_cutted['building_id'].unique():
        buildingID = train_data_cutted[train_data_cutted['building_id'] == building_id]
        for month_id in buildingID['month'].unique():
            buildingIDmonthID = buildingID[buildingID['month'] == month_id]
            for day_id in buildingIDmonthID['day'].unique():
                buildingIDmonthIDdayID = buildingIDmonthID[buildingIDmonthID['day'] == day_id]
                train = train.append(buildingIDmonthIDdayID[0:1], ignore_index=True)
                train.set_value(train.shape[0]-1, 'meter_reading_mean', buildingIDmonthIDdayID['meter_reading'].mean())
    
    train = train.drop(['hour', 'year', 'building_id', 'floor_count', 'meter_reading', 'meter'], axis=1)
    train['primary_use'] = LabelEncoder().fit_transform(train['primary_use'])
    train = train.apply(pd.to_numeric)
    return train

In [19]:
train_electricity = CreateMeanMeterReading(train_data_electricity, 10)
train_chilledWater = CreateMeanMeterReading(train_data_chilledWater, 10)
train_steam = CreateMeanMeterReading(train_data_steam, 10)
train_hotWater = CreateMeanMeterReading(train_data_hotWater, 10)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


In [20]:
train_electricity.head()

Unnamed: 0,day,meter_reading_mean,month,primary_use,site_id,square_feet,year_built
0,1,0.0,1,0,0,7432,2008.0
1,2,0.0,1,0,0,7432,2008.0
2,3,0.0,1,0,0,7432,2008.0
3,4,0.0,1,0,0,7432,2008.0
4,5,0.0,1,0,0,7432,2008.0


In [21]:
train_chilledWater.head()

Unnamed: 0,day,meter_reading_mean,month,primary_use,site_id,square_feet,year_built
0,29,2600.1584,2,0,0,121074,1989.0
1,1,2230.113667,3,0,0,121074,1989.0
2,2,2462.269708,3,0,0,121074,1989.0
3,3,2212.52625,3,0,0,121074,1989.0
4,4,1953.987667,3,0,0,121074,1989.0


In [22]:
train_steam.head()

Unnamed: 0,day,meter_reading_mean,month,primary_use,site_id,square_feet,year_built
0,1,0.0,1,0,6,13512,
1,2,0.0,1,0,6,13512,
2,3,3.753125,1,0,6,13512,
3,4,226.378058,1,0,6,13512,
4,5,242.507912,1,0,6,13512,


In [23]:
train_hotWater.head()

Unnamed: 0,day,meter_reading_mean,month,primary_use,site_id,square_feet,year_built
0,1,11.25,1,0,1,5374,
1,2,11.25,1,0,1,5374,
2,3,7.5,1,0,1,5374,
3,4,12.083333,1,0,1,5374,
4,5,12.608696,1,0,1,5374,


### Statistics

#### TO DO:

* correlation of features between themselves
* correlation of features with target values
* draw histograms, barcharts, ...


* drop unnecessary columns or join some features
* drop data outliers (data.column.quantile)

Create powers of features

In [None]:
target = TR['meter_reading_mean']
features = TR.drop(['meter_reading_mean'], axis=1)

In [None]:
features.head()

In [None]:
target.head()

In [None]:
dict = {}

features_powered = features ** 2

# give columns appropriate names
data_2power = features_powered
data_2power['meter_reading_mean'] = target
correlation = data_2power.corr()
target_column = correlation['meter_reading_mean']
for i in range(target_column.shape[0]):
    key = target_column.index.values[i]
    value = target_column.iloc[i]
    dict[key] = value
correlation['meter_reading_mean'].head()
print(dict)

target_column.head()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
 
plt.figure(figsize=(20,10))
 
list_of_powered_features = []
features_corr_dict = {}
 
for i in range(1,5):
    new_columns = []
    features_powered = features ** i
    # give columns appropriate names
    for column in features_powered.columns:
        new_columns.append(column + '^{0}'.format(i))
    features_powered.columns = new_columns
 
    list_of_powered_features.append(features_powered)
    data_2power = features_powered
    data_2power['target'] = target
    correlation = data_2power.corr()
    # add values to the dictionary of feature correlation
    target_column = correlation['target']
    for j in range(target_column.shape[0]):
        key = target_column.index.values[j]
        if key != 'target':
            value = target_column.iloc[j]
            features_corr_dict[key] = value
 
    plt.subplot(2,2,i)
    ax = sns.heatmap(correlation)

In [None]:
sorted_features_corr_dict = sorted(features_corr_dict.items(), key=lambda kv: kv[1])

In [None]:
sorted_features_corr_dict.reverse()

#### Top 5 of correlating features

In [None]:
top5 = []

for i in range(1,6):
    feature_name = sorted_features_corr_dict[i][0]
    top5.append(feature_name)
    
print(top5)

In [None]:
top_corr_features = pd.DataFrame()
for feature in top5:
    index = int(feature[-1]) - 1
    feature_powered = list_of_powered_features[index]
    top_corr_features[feature] = feature_powered[feature]

In [None]:
top_corr_features

## Metrics

#### TO DO:

* check unbalancing!
* What metrics will we use and why?

Evaluation Metric
The evaluation metric for this competition is Root Mean Squared Logarithmic Error.

The RMSLE is calculated as $RMSLE = \sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(\log \left(p_{i}+1\right)-\log \left(a_{i}+1\right)\right)^{2}}$

We have chosen RMSLE not to penalize huge differences in the predicted and the actual values when both predicted and true values are huge numbers (in this dataset they might go up to around 7 thousand kw/h). What is more, for this paticular problem overestimating meter readings is better than underestimating them since the goal is to find how much imporovements of buildings helped reduce their energy consumption. In this case again RMSLE is more appropriate than MSE.

In [None]:
# from sklearn.metrics import ...

## ML models

### Preprocessing

#### TO DO:

* split into train (80%) and test (val) (20%)
* OneHotEncoding for categorial features
* normalize (standardize) data

In [24]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_log_error


random_state = 1

In [25]:
def FeatureTargetSplit(data):
    target = data['meter_reading_mean']
    features = data.drop(['meter_reading_mean'], axis=1)
    
    return features, target

In [26]:
features_electricity, target_electricity = FeatureTargetSplit(train_electricity)
features_chilledWater, target_chilledWater = FeatureTargetSplit(train_chilledWater)
features_steam, target_steam = FeatureTargetSplit(train_steam)
features_hotWater, target_hotWater = FeatureTargetSplit(train_hotWater)

In [27]:
categorical_data_indices = [2, 3]
numerical_data_indices = [0, 1, 4, 5]

In [28]:
pipe_ridge = Pipeline(steps=[
    ('feature_processing', FeatureUnion(transformer_list=[
        ('encoding', Pipeline(steps = [('selecting', FunctionTransformer(lambda X : X[:, categorical_data_indices])),
                                      ('encoding', OneHotEncoder())
                                      ])),
        ('scaling', Pipeline(steps = [('selecting', FunctionTransformer(lambda X : X[:, numerical_data_indices])),
                                     ('scaling', StandardScaler())
                                     ]))
    ])),
    ('model', Ridge(random_state=random_state))
])

In [29]:
pipe_lasso = Pipeline(steps=[
    ('feature_processing', FeatureUnion(transformer_list=[
        ('encoding', Pipeline(steps = [('selecting', FunctionTransformer(lambda X : X[:, categorical_data_indices])),
                                      ('encoding', OneHotEncoder())
                                      ])),
        ('scaling', Pipeline(steps = [('selecting', FunctionTransformer(lambda X : X[:, numerical_data_indices])),
                                     ('scaling', StandardScaler())
                                     ]))
    ])),
    ('model', Lasso(random_state=random_state))
])

In [None]:
# X_train, X_val, y_train, y_val = train_test_split(features, target, test_size=0.2, random_state=random_state)

### Linear model

#### TO DO:

* Choose some linear models
* Find a good combination of hyperparameters via cross-validation
* plot dependency between score and some hyperparameter

In [None]:
# param_grid = {'model__alpha': np.logspace(-3, 3, 7, base=10)}
param_grid = {'model__alpha': [10**i for i in range(-3, 3)]}
grid_search_ridge = GridSearchCV(pipe_ridge, param_grid, cv=3, n_jobs=-1, scoring='neg_mean_squared_error') 
grid_search_ridge.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % -grid_search_ridge.best_score_)
print(grid_search_ridge.best_params_)

In [None]:
results_ridge = grid_search_ridge.cv_results_
plt.figure(figsize=[12., 9.])
plt.title("GridSearchCV for Ridge Regression")
plt.xlabel("Regularization coefficient")
plt.ylabel("MSE")

x_axis = np.array(param_grid['model__alpha'], dtype=float)
y_axis = -results_ridge['mean_test_score']
plt.plot(x_axis, y_axis)

In [None]:
grid_search_lasso = GridSearchCV(pipe_lasso, param_grid, cv=3, n_jobs=-1, scoring='neg_mean_squared_error') 
grid_search_lasso.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % -grid_search_lasso.best_score_)
print(grid_search_lasso.best_params_)

In [None]:
results_lasso = grid_search_lasso.cv_results_
plt.figure(figsize=[12., 9.])
plt.title("GridSearchCV for Lasso Regression")
plt.xlabel("Regularization coefficient")
plt.ylabel("MSE")

x_axis = np.array(param_grid['model__alpha'], dtype=float)
y_axis = -results_lasso['mean_test_score']
plt.plot(x_axis, y_axis)

In [None]:
from sklearn.metrics import mean_squared_error

if -grid_search_lasso.best_score_ < -grid_search_ridge.best_score_:
    fin_model = Lasso(alpha=grid_search_lasso.best_params_['model__alpha'], random_state=random_state)
else: 
    fin_model = Ridge(alpha = grid_search_ridge.best_params_['model__alpha'], random_state=random_state)

fin_model.fit(X_train, y_train)
y_pred = fin_model.predict(X_val)
mean_squared_error(y_pred, y_val)

In [30]:
def SuperGridSearch(features, target, param_grid, pipe):
    grid_search = GridSearchCV(pipe, param_grid, cv=3, n_jobs=-1, scoring='neg_mean_squared_error') 
    grid_search.fit(features, target)
    return -grid_search_ridge.best_score_, grid_search

In [31]:
def BestModel(features, target, pipes):
    scores = []
    for name, pipe in pipes:
        score, grid = SuperGridSearch(features, target, param_grid, pipe_ridge)
        scores.append([score, name, grid])
    
    best_model = scores[0]
    for score, name, grid in scores:
        if score < best_model[0]:
            best_model = [score, name, grid]
    return best_model[1:]

In [32]:
def Predict(name, alpha, X_train, y_train, X_test):
    if (name == 'lasso'):
        model = Lasso(alpha=alpha, random_state=random_state)
    else:
        model = Ridge(alpha=alpha, random_state=random_state)
    model.fit(X_train, y_train)
    return model.predict(X_test)

In [33]:
def GetBuildingParameters(train_data):
    building_parameters = pd.DataFrame(columns=train_data.columns)

    for building_id in train_data['building_id'].unique():
        buildingID = train_data[train_data['building_id'] == building_id]
        building_parameters = building_parameters.append(buildingID[0:1], ignore_index=True)
    
    building_parameters = building_parameters.drop(['floor_count', 'timestamp', 'meter_reading', 'meter'], axis=1)
    building_parameters['primary_use'] = LabelEncoder().fit_transform(building_parameters['primary_use'])
    building_parameters = building_parameters.apply(pd.to_numeric)
    return building_parameters

In [34]:
building_parameters = GetBuildingParameters(train_data)
building_parameters.head()

Unnamed: 0,building_id,site_id,primary_use,square_feet,year_built
0,0,0,0,7432,2008.0
1,1,0,0,2720,2004.0
2,2,0,0,5376,1991.0
3,3,0,0,23685,2002.0
4,4,0,0,116607,1975.0


In [35]:
%%time
test_data = pd.read_csv('data' + os.sep + 'test.csv')

Wall time: 17.3 s


In [36]:
test_data = test_data.join(building_parameters.set_index('building_id'), on='building_id', how='inner')
test_data.head()

Unnamed: 0,row_id,building_id,meter,timestamp,site_id,primary_use,square_feet,year_built
0,0,0,0,2017-01-01 00:00:00,0,0,7432,2008.0
129,129,0,0,2017-01-01 01:00:00,0,0,7432,2008.0
258,258,0,0,2017-01-01 02:00:00,0,0,7432,2008.0
387,387,0,0,2017-01-01 03:00:00,0,0,7432,2008.0
516,516,0,0,2017-01-01 04:00:00,0,0,7432,2008.0


In [37]:
test_data_electricity = test_data[test_data['meter'] == 0]
test_data_chilledWater = test_data[test_data['meter'] == 1]
test_data_steam = test_data[test_data['meter'] == 2]
test_data_hotWater = test_data[test_data['meter'] == 3]

In [38]:
test_data_electricity['timestamp'] = pd.to_datetime(test_data_electricity['timestamp'])
test_data_chilledWater['timestamp'] = pd.to_datetime(test_data_chilledWater['timestamp'])
test_data_steam['timestamp'] = pd.to_datetime(test_data_steam['timestamp'])
test_data_hotWater['timestamp'] = pd.to_datetime(test_data_hotWater['timestamp'])

test_data_electricity['month'] = test_data_electricity['timestamp'].dt.month
test_data_electricity['day'] = test_data_electricity['timestamp'].dt.day
test_data_electricity.drop(['timestamp'], axis=1, inplace=True)

test_data_chilledWater['month'] = test_data_chilledWater['timestamp'].dt.month
test_data_chilledWater['day'] = test_data_chilledWater['timestamp'].dt.day
test_data_chilledWater.drop(['timestamp'], axis=1, inplace=True)

test_data_steam['month'] = test_data_steam['timestamp'].dt.month
test_data_steam['day'] = test_data_steam['timestamp'].dt.day
test_data_steam.drop(['timestamp'], axis=1, inplace=True)

test_data_hotWater['month'] = test_data_hotWater['timestamp'].dt.month
test_data_hotWater['day'] = test_data_hotWater['timestamp'].dt.day
test_data_hotWater.drop(['timestamp'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .l

In [39]:
test_data_electricity = test_data_electricity.drop(['meter', 'building_id'], axis=1)
test_data_chilledWater = test_data_chilledWater.drop(['meter', 'building_id'], axis=1)
test_data_steam = test_data_steam.drop(['meter', 'building_id'], axis=1)
test_data_hotWater = test_data_hotWater.drop(['meter', 'building_id'], axis=1)

In [40]:
test_data_electricity = test_data_electricity[['row_id', 'day', 'month', 'primary_use', 'site_id', 'square_feet', 'year_built']]
test_data_chilledWater = test_data_chilledWater[['row_id', 'day', 'month', 'primary_use', 'site_id', 'square_feet', 'year_built']]
test_data_steam = test_data_steam[['row_id', 'day', 'month', 'primary_use', 'site_id', 'square_feet', 'year_built']]
test_data_hotWater = test_data_hotWater[['row_id', 'day', 'month', 'primary_use', 'site_id', 'square_feet', 'year_built']]

In [41]:
param_grid = {'model__alpha': [10**i for i in range(-3, 3)]}

In [43]:
name, grid = BestModel(features_electricity, target_electricity, [['ridge', pipe_ridge], ['lasso', pipe_lasso]])
pred_electricity = Predict(name, grid.best_params_['model__alpha'], features, target, test_data_electricity.drop('row_id', axis=1))

ValueError: unknown categorical feature present [2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] during transform.

In [None]:
pred_electricity

## Analyzing of results

#### TO DO:

* Make `test.csv` files with our results
* send them into Kaggle in turn

### Comparison of models

#### TO DO:

* Compare results of Linear and Classifier models (take test scores from kaggle)
* Write conclusion