### multiple classifications algorithm with hyperparameter

#### Building energy efficiency comparison

A study investigates the energy efficiency of residential buildings, in particular the heating and cooling requirements, as a function of architectural characteristics such as wall area, glass area, orientation, etc.

The dataset used contains eight attributes describing these characteristics for 768 buildings and two target attributes: the heating and cooling loads of these buildings.

The objective of the exercise is to predict the loads for each building, based on the first eight attributes.

The dataset is to be read from the file "ENB_data.csv". Note that the columns are separated by ';'.

In [5]:
#Load the file "ENB_data.csv" and perform a first audit of the data in a data frame df.
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn import ensemble
from sklearn import svm
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import VotingClassifier

df = pd.read_csv("/Users/shahul/Desktop/ENB2012_data.csv")
df.head()

df.rename(columns = {'X1':'Relative Compactness','X2':'Surface Area','X3':'Wall Area',
                     'X4':'Roof Area','X5':'Overall Height','X6':'Orientation',
                     'X7':'Glazing Area','X8':'Glazing Area Distribution','Y1':'heating_load',
                     'Y2':'cooling_load'}, inplace= True)
df.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Orientation,Glazing Area,Glazing Area Distribution,heating_load,cooling_load
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28


Analyzing the correlations between all df variables.
Which explanatory variables are most correlated with the two target variables?

In [2]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)


Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Orientation,Glazing Area,Glazing Area Distribution,Heating_Load,Cooling_Load
Relative Compactness,1.0,-0.99,-0.2,-0.87,0.83,0.0,0.0,0.0,0.62,0.63
Surface Area,-0.99,1.0,0.2,0.88,-0.86,0.0,0.0,-0.0,-0.66,-0.67
Wall Area,-0.2,0.2,1.0,-0.29,0.28,0.0,-0.0,0.0,0.46,0.43
Roof Area,-0.87,0.88,-0.29,1.0,-0.97,0.0,-0.0,-0.0,-0.86,-0.86
Overall Height,0.83,-0.86,0.28,-0.97,1.0,0.0,0.0,0.0,0.89,0.9
Orientation,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-0.0,0.01
Glazing Area,0.0,0.0,-0.0,-0.0,0.0,0.0,1.0,0.21,0.27,0.21
Glazing Area Distribution,0.0,-0.0,0.0,-0.0,0.0,0.0,0.21,1.0,0.09,0.05
Heating_Load,0.62,-0.66,0.46,-0.86,0.89,-0.0,0.27,0.09,1.0,0.98
Cooling_Load,0.63,-0.67,0.43,-0.86,0.9,0.01,0.21,0.05,0.98,1.0


In [3]:
'''The explanatory variables most correlated with the 2 target variables are in order:
Overall Height, Roof area, surface area and relative compactness'''
print("")




The next step is to create an optimal classification model after grouping the buildings into classes based on the total energy loads (heating + cooling).

Create a new column in df, called total_loads, totaling the heating and cooling loads for each building.
In a new variable loads_classes, split the buildings into 4 distinct classes with labels 0, 1, 2, 3 according to the 3 quantiles of the new variable created.
The quantiles of a variable can be found with the describe method of the pandas.series or with the quantile function



In [16]:
df['total_charges'] = df.heating_load + df.cooling_load
#df.head()
#df.describe()
df["charges_classes"]=pd.qcut(df.total_charges,4,labels=["0","1","2", "3"])
#df.head(40)


Store in a data variable, the explanatory data only.
Separate the data into a training set and a test set (20%), with data as the explanatory data and loads_classes as the target variable.
let's center and reduce the explanatory variables in both samples appropriately.

In [17]:
data = df.iloc[:,0:11] # exploratory data
#df.info()
target = df.charges_classes #target data

#seprating test and train data
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size = 0.2)


scaler = preprocessing.StandardScaler().fit(X_train) # to center reduce
X_train_scaled = scaler.transform(X_train) #to apply in a dataframe

X_test_scaled = scaler.transform(X_test)


In the following we will compare several learning methods. For each of them, we will explore the perimeter of the following hyperparameters:

K-nearest neighbors. Hyperparameter to set :'n_neighbors': 2 to 50.

SVM. Hyperparameters to set:  kernel: 'rbf', 'linear'.  
                              C : 0.1 ; 1 ; 10 ; 50 .

RandomForest. Hyperparameters to set : 'max_features': 'sqrt', 'log2', None
                                       'min_samples_split': Even numbers from 2 to 30.

For each algorithm mentioned above:

we will select the hyperparameters on the cross-validation learning sample and Display the selected hyperparameters
Apply the model to the test set, display the confusion matrix and the model score on the test set

finally we will evaluate which model give the best accuracy

K-nearest neighbors

In [26]:
from sklearn.model_selection import GridSearchCV

#create new a knn model
knn = neighbors.KNeighborsClassifier()

#create a dictionary of all values we want to test for n_neighbors
param_grid = {'n_neighbors': np.arange(2, 51)} 

#use gridsearch to test all values for n_neighbors
knn_gscv = GridSearchCV(knn, param_grid, cv=5) 

#fit model to data
knn_gscv.fit(X_train, y_train)

#check top performing n_neighbors value
knn_gscv.best_params_ #neighbor 3 is best parameters

#check mean score for the top performing value of n_neighbors
knn_gscv.best_score_ #97% !

0.9690523790483805

In [35]:
# we train with the best n values
knn = neighbors.KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

#matrice de confusion

pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])



Classe prédite,0,1,2,3
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,38,2,0,0
1,1,34,2,0
2,0,1,40,0
3,0,0,0,36


In [37]:
#score du modele 
score_knn = knn.score(X_test, y_test)
print(score_knn)

0.961038961038961


SVM

In [71]:
#create new svm model 
clf = svm.SVC()
clf.fit(X_train_scaled,y_train)

#dictionnaire parametres contenant les valeurs possibles
parametres = {'C':[0.1,1,10,50],'kernel':['rbf','linear']}

#use gridsearch to test all values
grid_clf = model_selection.GridSearchCV(estimator=clf, param_grid=parametres)

#fit model to data
grille = grid_clf.fit(X_train_scaled,y_train)
pd.DataFrame.from_dict(grille.cv_results_).loc[:,["params","mean_test_score"]]

#print top performing n_neighbors value
print(grid_clf.best_params_)

#check mean score for the top performing value 
print("score knn ", grid_clf.best_score_.round(2)) #98%

{'C': 50, 'kernel': 'linear'}
score knn  0.98


In [48]:
clf = svm.SVC(C=50, kernel='linear')
clf.fit(X_train_scaled,y_train)

y_pred = grid_clf.predict(X_test_scaled)
pd.crosstab(y_test, y_pred, rownames=['Classe réelle'], colnames=['Classe prédite'])


Classe prédite,0,1,2,3
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,38,2,0,0
1,1,35,1,0
2,0,0,41,0
3,0,0,0,36


In [72]:
score_svm = clf.score(X_test_scaled, y_test)
print("score svm ", score_svm.round(2))

score svm  0.95


RandomForest

In [77]:
clf2 = ensemble.RandomForestClassifier()
clf2.fit(X_train, y_train)

#dict parametres containing possible values
param = {'min_samples_split': np.arange(2, 31, 2),'max_features':['sqrt', 'log2', None]}

#use gridsearch to test all values
grid_clf2 = model_selection.GridSearchCV(estimator=clf2, param_grid=param)

#fit model to data
grille2 = grid_clf2.fit(X_train,y_train)
pd.DataFrame.from_dict(grille2.cv_results_).loc[:,["params","mean_test_score"]]

#print top performing n_neighbors value
print(grid_clf2.best_params_)

#check mean score for the top performing value 
print(grid_clf2.best_score_)#99 !

        


{'max_features': 'log2', 'min_samples_split': 16}
0.9951086232173797


In [57]:
clf2 = ensemble.RandomForestClassifier(max_features= 'sqrt', min_samples_split= 22)
clf2.fit(X_train, y_train)

y_pred = clf2.predict(X_test)
pd.crosstab(y_test, y_pred, rownames = ['Classe réelle'], colnames = ['Classe prédite'])

Classe prédite,0,1,2,3
Classe réelle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,40,0,0,0
1,0,35,2,0
2,0,0,41,0
3,0,0,0,36


In [78]:
score_randomforest = clf2.score(X_test, y_test)
print("score random forest ",score_randomforest.round(2))

score random forest  0.99


We will now create a set method

Create vc, an instance of the VotingClassifier class which takes as parameter the three models chosen previously and which uses the hard voting mode

Does this model provide a better accuracy ?

In [70]:

# Voting Classifier with hard voting 
vc = VotingClassifier(estimators = [('knn', knn), ('svm', clf), ('rf', clf2)], voting ='hard') 
vc = vc.fit(X_train, y_train)
print('Hard vote accuracy =', vc.score(X_train, y_train))

#no it doesnt give us better details concerinig voting classification

Hard vote accuracy = 1.0
