### Постановка задачи
Построить модель линейной регрессии энергопотребления здания, используя температуру воздуха (air_temperature) и влажность (dew_temperature).

Рассчитать качество построенной модели по проверочным данным.

Данные:
* http://video.ittensive.com/machine-learning/ashrae/building_metadata.csv.gz
* http://video.ittensive.com/machine-learning/ashrae/weather_train.csv.gz
* http://video.ittensive.com/machine-learning/ashrae/train.0.0.csv.gz
Соревнование: https://www.kaggle.com/c/ashrae-energy-prediction/

© ITtensive, 2020

### Подключение библиотек

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

### Загрузка данных

In [2]:
buildings = pd.read_csv("http://video.ittensive.com/machine-learning/ashrae/building_metadata.csv.gz")
weather = pd.read_csv("http://video.ittensive.com/machine-learning/ashrae/weather_train.csv.gz")
energy_0 = pd.read_csv("http://video.ittensive.com/machine-learning/ashrae/train.0.0.csv.gz")
print (energy_0.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8784 entries, 0 to 8783
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   building_id    8784 non-null   int64  
 1   meter          8784 non-null   int64  
 2   timestamp      8784 non-null   object 
 3   meter_reading  8784 non-null   float64
dtypes: float64(1), int64(2), object(1)
memory usage: 274.6+ KB
None


### Объединение данных и фильтрация

In [3]:
energy_0 = pd.merge(left=energy_0, right=buildings, how="left",
                   left_on="building_id", right_on="building_id")
energy_0.set_index(["timestamp", "site_id"], inplace=True)
weather.set_index(["timestamp", "site_id"], inplace=True)
energy_0 = pd.merge(left=energy_0, right=weather, how="left",
                   left_index=True, right_index=True)
energy_0.reset_index(inplace=True)
energy_0 = energy_0[energy_0["meter_reading"] > 0]
energy_0["timestamp"] = pd.to_datetime(energy_0["timestamp"])
energy_0["hour"] = energy_0["timestamp"].dt.hour
print (energy_0.head())

               timestamp  site_id  building_id  meter  meter_reading  \
704  2016-01-30 08:00:00        0            0      0        43.6839   
725  2016-01-31 05:00:00        0            0      0        37.5408   
737  2016-01-31 17:00:00        0            0      0        52.5571   
2366 2016-04-08 14:00:00        0            0      0        59.3827   
2923 2016-05-01 19:00:00        0            0      0       448.0000   

     primary_use  square_feet  year_built  floor_count  air_temperature  \
704    Education         7432      2008.0          NaN              8.3   
725    Education         7432      2008.0          NaN             12.8   
737    Education         7432      2008.0          NaN             20.6   
2366   Education         7432      2008.0          NaN             21.7   
2923   Education         7432      2008.0          NaN             31.1   

      cloud_coverage  dew_temperature  precip_depth_1_hr  sea_level_pressure  \
704              NaN              6.

Загрузите данные и очистите значения (нулями и средними). Построить модель линейной регрессии для каждого часа в отдельности, используя температуру воздуха (air_temperature), влажность (dew_temperature), атмосферное давление (sea_level_pressure), скорость ветра (wind_speed) и облачность (cloud_coverage).

### Добавление часа в данные

In [4]:
energy_0["timestamp"] = pd.to_datetime(energy_0["timestamp"])
energy_0["hour"] = energy_0["timestamp"].dt.hour

In [5]:
energy_0.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5411 entries, 704 to 8783
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   timestamp           5411 non-null   datetime64[ns]
 1   site_id             5411 non-null   int64         
 2   building_id         5411 non-null   int64         
 3   meter               5411 non-null   int64         
 4   meter_reading       5411 non-null   float64       
 5   primary_use         5411 non-null   object        
 6   square_feet         5411 non-null   int64         
 7   year_built          5411 non-null   float64       
 8   floor_count         0 non-null      float64       
 9   air_temperature     5411 non-null   float64       
 10  cloud_coverage      3150 non-null   float64       
 11  dew_temperature     5411 non-null   float64       
 12  precip_depth_1_hr   5411 non-null   float64       
 13  sea_level_pressure  5383 non-null   float64   

In [6]:
energy_0 = pd.DataFrame(energy_0, columns=['meter_reading', 'air_temperature', 'dew_temperature', 'sea_level_pressure', 'wind_speed', 'cloud_coverage'])
energy_0.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5411 entries, 704 to 8783
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   meter_reading       5411 non-null   float64
 1   air_temperature     5411 non-null   float64
 2   dew_temperature     5411 non-null   float64
 3   sea_level_pressure  5383 non-null   float64
 4   wind_speed          5411 non-null   float64
 5   cloud_coverage      3150 non-null   float64
dtypes: float64(6)
memory usage: 295.9 KB


In [7]:
energy_0.fillna(energy_0.mean(), inplace=True)

In [8]:
energy_0.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5411 entries, 704 to 8783
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   meter_reading       5411 non-null   float64
 1   air_temperature     5411 non-null   float64
 2   dew_temperature     5411 non-null   float64
 3   sea_level_pressure  5411 non-null   float64
 4   wind_speed          5411 non-null   float64
 5   cloud_coverage      5411 non-null   float64
dtypes: float64(6)
memory usage: 295.9 KB


### Разделение данных на обучение и проверку

In [9]:
energy_0_train, energy_0_test = train_test_split(energy_0, test_size=0.2)
print (energy_0_train.head())

      meter_reading  air_temperature  dew_temperature  sea_level_pressure  \
8670        240.944             19.4             18.3              1025.5   
4974        268.929             25.6             22.2              1018.8   
6600        245.039             27.8             21.1              1016.0   
7093        245.722             16.7              7.8              1015.9   
8139        230.023             20.6             17.8              1017.9   

      wind_speed  cloud_coverage  
8670         1.5             2.0  
4974         0.0             2.0  
6600         4.1             6.0  
7093         4.1             0.0  
8139         2.1             0.0  


### Модель линейной регрессии

In [10]:
y = energy_0_train["meter_reading"]
x = energy_0_train.drop(labels=["meter_reading"], axis=1)
model = LinearRegression().fit(x, y)
print (model.coef_, model.intercept_)

[ 2.73292979  3.63255151 -0.77748459 -2.35893162 -3.2176081 ] 907.542400682675


### Оценка модели

In [11]:
y_test = energy_0_test["meter_reading"]
X_text = energy_0_test.drop(labels=["meter_reading"], axis=1)
predict_test = model.predict(X_text)

RMSLE = metrics.mean_squared_log_error(y_true=y_test, y_pred=predict_test, squared=False)

In [12]:
print(f'RMSLE: {RMSLE}')

RMSLE: 0.19786893362148153
