### Table of Contents
- [Introduction](#introduction)
- [1. Loading dataset and libraries](#loading)
- [2. Preprocessing](#preprocessing)
    - [2.1 Rename columns](#rename-columns)
    - [2.2 Data types](#data-types)
    - [2.2 Missing values](#missing-values)
    - [2.3 Duplicated data](#duplicated-data)
- [3. Statistical description](#statistical)
    - [3.1 Description](#description)
    - [3.2 Scatter Plots, Clustering and Data Exploration](#exploration)
    - [3.3 Correlation Heatmap](#correlation)
- [4. Models](#models)
    - [4.1 Standardization and splitting](#standardization)
    - [4.2 Model Zero](#model-0)
    - [4.3 Grid Search](#grid)

## Introduction <a name="introduction" />
### Danijel Sokolovic
### Indeks: 1392

### Energy efficiency
### Features
1. **X1 - A - Relative Compactness**\
This value is obtained by the sum of all surfaces of its envelope, divided by its gross heated volume.\
The Rc of a shape is derived in that its volume to surface ratio is compared to that of the most compact shape with the same volume.    
2. X2 - B - Surface Area
3. X3 - C - Wall Area
4. X4 - D - Roof Area
5. X5 - E - Overall Height
6. X6 - F - Orientation
7. X7 - G - Glazing Area
8. X8 - H - Glazing Area Distribution

### Output (target values)
- y1 Heating Load
- y2 Cooling Load

## 1. Loading dataset and libraries <a name="loading" />

In [123]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import collections
from pprint import pprint
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from numpy import mean
from numpy import std
from numpy import absolute
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
from tensorflow.keras.optimizers import SGD

from datetime import datetime


df = pd.read_csv("en_eff.csv")
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28


In [124]:
df.shape

(768, 10)

Vidimo da imamo 768 rows i 10 columns, features, sto se uklapa sa opisom dataseta.

## 2. Preprocessing <a name="preprocessing"/>

### 2.1 Rename columns  <a name="rename-columns"/>

In [125]:
df.columns = ['relative_compactness', 
              'surface_area', 
              'wall_area', 
              'roof_area', 
              'overall_height', 
              'orientation', 
              'glazing_area', 
              'glazing_distribution', 
              'heating_load', 
              'cooling_load']
df.head(20)

Unnamed: 0,relative_compactness,surface_area,wall_area,roof_area,overall_height,orientation,glazing_area,glazing_distribution,heating_load,cooling_load
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28
5,0.9,563.5,318.5,122.5,7.0,3,0.0,0,21.46,25.38
6,0.9,563.5,318.5,122.5,7.0,4,0.0,0,20.71,25.16
7,0.9,563.5,318.5,122.5,7.0,5,0.0,0,19.68,29.6
8,0.86,588.0,294.0,147.0,7.0,2,0.0,0,19.5,27.3
9,0.86,588.0,294.0,147.0,7.0,3,0.0,0,19.95,21.97


### 2.2 Data types  <a name="data-types"/>

In [126]:
df.dtypes

relative_compactness    float64
surface_area            float64
wall_area               float64
roof_area               float64
overall_height          float64
orientation               int64
glazing_area            float64
glazing_distribution      int64
heating_load            float64
cooling_load            float64
dtype: object

Potvrdjujemo da se radi samo sa brojevima, realni i celi brojevi.

### 2.3 Missing values  <a name="missing-values"/>

In [127]:
df.isnull().any()

relative_compactness    False
surface_area            False
wall_area               False
roof_area               False
overall_height          False
orientation             False
glazing_area            False
glazing_distribution    False
heating_load            False
cooling_load            False
dtype: bool

Proveravamo da li ima nekih NaN vrednosti. Zakljucak je da nema takvih vrednosti.

In [128]:
df.drop('orientation', axis=1, inplace=True)
df.drop('surface_area', axis=1, inplace=True)
df.drop('roof_area', axis=1, inplace=True)
df.drop('glazing_area', axis=1, inplace=True)
df.drop('glazing_distribution', axis=1, inplace=True)
df.drop('cooling_load', axis=1, inplace=True)

### 2.3 Duplicated data  <a name="duplicated-data"/>

In [129]:
duplicates = df.duplicated().sum()
print(duplicates)
df = df.drop_duplicates()
duplicates = df.duplicated().sum()
print(duplicates)

111
0


Proveravamo da li ima nekih duplih vrednosti, misli se na duplirane redove. Zakljucak je da nema dupliranih redova u datasetu.

## 3. Statistical Description  <a name="statistical"/>

### 3.1 Description <a name="description"/>

In [130]:
df.describe()

Unnamed: 0,relative_compactness,wall_area,overall_height,heating_load
count,657.0,657.0,657.0,657.0
mean,0.768706,319.618721,5.369863,23.050457
std,0.104342,43.904427,1.74722,10.137297
min,0.62,245.0,3.5,6.01
25%,0.69,294.0,3.5,13.78
50%,0.76,318.5,7.0,23.75
75%,0.86,343.0,7.0,32.15
max,0.98,416.5,7.0,43.1


## 4. Models <a name="models"/>

### 4.1 Standardization and splitting <a name="standardization"/>

In [131]:
X = df[[    
    'relative_compactness', 
    #'surface_area', 
    'wall_area', 
    #'roof_area', 
    'overall_height', 
    #'orientation', 
    #'glazing_area',
    #'glazing_distribution'
]]
y = df[['heating_load']]

In [132]:
#    fit()               => Compute the mean and std to be used for later scaling.
#    transform()         => Perform standardization by centering and scaling.
#    fit_transform()     => Fit to data, then transform it.
#    inverse_transform() => Scale back the data to the original representation.

#predictorScalerFit=StandardScaler().fit(X)
#targetVarScalerFit=StandardScaler().fit(y)

#X=predictorScalerFit.transform(X)
#y=targetVarScalerFit.transform(y)

sc = StandardScaler()
X = sc.fit_transform(X)
y = sc.fit_transform(y)

In [133]:
# Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [134]:
#print("Ceo dataset X: {}".format(X.shape))
#print("Ceo dataset Y: {}".format(y.shape))
#print("X dataset za treniranje X_train: {}".format(X_train.shape))
#print("X dataset za testiranje X_test: {}".format(X_test.shape))1
#print("y dataset za treniranje Y_train: {}".format(y_train.shape))
#print("y dataset za testiranje Y_test: {}".format(y_test.shape))

### Sledi objasnjenje ovih stvari
- batch size &  epochs
- optimization algorithms
- learning rate & momentum
- network weight initialization
- neuron activation function
- dropout regularization
- number of neurons in hidden layers
- number of hidden layers

In [135]:
def base_model_one():
    model = Sequential()
    model.add(Dense(units=10, input_dim=3, activation='relu'))
    model.add(Dense(units=1, kernel_initializer='normal'))
    model.compile(loss='mse', optimizer='adam', metrics=['mse'])
    return model

model = KerasRegressor(build_fn=base_model_one, verbose=1)
param_grid={
    'batch_size':[5, 10],
    'nb_epoch':[10, 50, 100]
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=3)
grid_result = grid.fit(X, y)
print("Best: {}, using {}.".format(grid_result.best_score_, grid_result.best_params_))

Best: -0.6609626412391663, using {'batch_size': 5, 'nb_epoch': 100}.
