# ML Application Example
## Regression

The task of this example is to implement a complete Data Driven pipeline (load, data-analysis, visualisation, model selection and optimization, prediction) on a specific Dataset. In this example the challenge is to perform a regression with different models to find the most accurate prediction.  

<a href="https://marcozamana.azurewebsites.net/"> <img src="https://marcozamana.azurewebsites.net/wp-content/uploads/2018/12/pipeline.png"></a>


## Dataset 
The notebook will upload a public available dataset: https://archive.ics.uci.edu/ml/datasets/energy+efficiency
<blockquote>
  <b>Source:</b>
    The dataset was created by Angeliki Xifara (angxifara@gmail.com, Civil/Structural Engineer) and was processed by Athanasios    
    Tsanas (tsanasthanasis@gmail.com, Oxford Centre for Industrial and Applied Mathematics, University of Oxford, UK).
    <br/>
    <b>Data Set Information:</b>
    The author of the dataset perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with 
    respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. They simulate various 
    settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 
    features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response 
    is rounded to the nearest integer.
    <br/>
    <b>Attribute Information:</b>
    The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The 
    aim is to use the eight features to predict each of the two responses.
    <br/>
    <b>Specifically:</b>
    <br/>
    <code><br/> X1 :: Relative Compactness <br/> X2 :: Surface Area <br/> X3 :: Wall Area <br/> X4 :: Roof Area </code>
    <code><br/> X5 :: Overall Height <br/> X6 :: Orientation <br/> X7 :: Glazing Area <br/> X8 :: Glazing Area Distribution </code>
    <code><br/> y1 :: Heating Load <br/> y2 :: Cooling Load </code>
    <br/>
</blockquote>

In [None]:
# algebra
import numpy as np
# data structure
import pandas as pd
# data visualization
import matplotlib.pylab as plt
import seaborn as sns
#file handling
from pathlib import Path


# Data load
The process consist in downloading the data if needed, loading the data as a Pandas dataframe

In [None]:
    
filename  = "ENB2012_data.xlsx"
separator = ';'
columns   = None

#if the dataset is not already in the working dir, it will download
my_file = Path(filename)
if not my_file.is_file():
  print("Downloading dataset")
  !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx


In [None]:
#function to semplificate the load of dataset, in case it is a csv, tsv or excel file
#output is a pandas dataframe 
def load_csv(filename,separator,columns):
    
    try:
    
        csv_table = pd.read_csv(filename,sep=separator,names=columns)
    
    except:
        
        csv_table = pd.read_excel(filename,names=columns)
    print("n. samples: {}".format(csv_table.shape[0]))
    print("n. columns: {}".format(csv_table.shape[1]))

    return csv_table #.dropna()

data = load_csv(filename,separator,columns)

# Data Analysis and Visualization
In this section confidence with the data is gained, data are plotted and cleaned

In [None]:
#How does the dataset look like? 
data.head()

In [None]:
#Name of all columns
print(data.columns.values)

In [None]:
#let's have a look at the data and their correlations, if any
sns.pairplot(data)

In [None]:
#Calculate the correlation matrix
corrMatrix = data.corr()

plt.figure(figsize=[13,9])
sns.heatmap(corrMatrix,annot=True)
plt.show()

In [None]:
#Select only the interesting variable for the model, and remove any anomalous value (e.g. "nan")
data = data[['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8' ,'Y1', 'Y2']]
data = data.dropna()

# Machine Learning
Here the interesting input features and output to predict for the task are selected, the data are opportunelly preprocessed (i.e. normalized), the dataset is splitted in two separate train and test subsets, each model is trained on the training data and evaluated against a test set. <br/>
The evaluation metrics list can be found <a href='https://scikit-learn.org/stable/modules/model_evaluation.html'>here</a>

In [None]:
#the module needed for the modeling and data mining are imported
#Cross-Validation 
from sklearn.model_selection import train_test_split
#Data normalization
from sklearn.preprocessing   import StandardScaler
#metrics to evaluate the model
from sklearn.metrics import mean_squared_error

In [None]:
#Selection of feature and output variable, definition of the size (fraction of the total) of the random selected test set
input_features = ['X1', 'X2', 'X3', 'X4', 'X5', 'X7']
output         = ['Y1']
test_size      = 0.33
random_state   = 0

In [None]:
#not preprocessed data
unnormalized_X,unnormalized_y = data[input_features],data[output]

In [None]:
# normalisation
#Having features on a similar scale can help the model converge more quickly towards the minimum
scaler_X = StandardScaler().fit(unnormalized_X)
scaler_y = StandardScaler().fit(unnormalized_y)
X = scaler_X.transform(unnormalized_X)
y = scaler_y.transform(unnormalized_y)

In [None]:
#check if nan are present on the data after normalization to avoid trouble later
sum(np.isnan(X))

In [None]:
# basic train-test dataset random split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=test_size,
                                                    random_state=random_state)

In [None]:
#dictionary to help the display of the results
Score_Dict = {}

#function introduced to simplifies the following comparison and test of the various
#return the trained model and the score of the selected metrics
def fit_predict_plot(model,X_train,y_train,X_test,y_test):
    model.fit(X_train,y_train.ravel())

    pred_normalized_y_test = model.predict(X_test)
    pred_y_test            = scaler_y.inverse_transform(pred_normalized_y_test)
    real_y_test            = scaler_y.inverse_transform(y_test)

    mse_score = mean_squared_error(real_y_test,pred_y_test)
    
    model_name = type(model).__name__
    if(model_name=='GridSearchCV'):
        model_name ='CV_'+type(model.estimator).__name__

    #Alternative metrics are listed here:https://scikit-learn.org/stable/modules/model_evaluation.html
    Score_Dict[model_name]=mse_score

    plt.figure(figsize=[5,5])
    plt.scatter(real_y_test,pred_y_test)
    plt.plot([real_y_test.min(),real_y_test.max()],[real_y_test.min(),real_y_test.max()],'k:')
    plt.axis('equal')
    plt.title("Mean Squared Error: {:.2f}".format(mse_score))
    plt.xlabel('Real Heating Load')
    plt.ylabel('Predicted Heating Load')
    
    return model,mse_score



## Linear models
Used linear models in this example are:
<ul>
    <li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">Linear Regression</a></li>
    <li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html">Lasso</a></li>
    <li><a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html">Ridge</a></li>
</ul>

In [None]:
#Import the module that allows to access the Linear Regression, Lasso and Ridge algorithm
from sklearn import linear_model

In [None]:
#initialization, fit and evaluation of the model
model = linear_model.LinearRegression()
basic_linear_model, basic_linear_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)

#check the output of the model
print(basic_linear_model.coef_)
print(basic_linear_model.intercept_)



# Ridge
For advanced algorithms, hyper-parameters need to be specified, they influence the convergence and the results of the model.  

In [None]:
# Regularization strength hyper-parameter; must be a positive float
alpha = 10

In [None]:
#initialization, fit and evaluation of the model
model = linear_model.Ridge(alpha=alpha)
ridge_model, ridge_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)

#check the output of the model
print(ridge_model.coef_)

# Lasso


In [None]:
# hyperparametr: alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression
alpha = 0.1

In [None]:

model = linear_model.Lasso(alpha=alpha)
lasso_model,lasso_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)

print(lasso_model.coef_)

# Kernel ridge
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html">Kernel ridge regression (KRR)</a> combines ridge regression with the kernel trick.

In [None]:
from sklearn.kernel_ridge import KernelRidge

In [None]:
# regularization hyper-parameter
alpha  = 0.01
kernel = 'rbf'#'polynomial'

In [None]:
model = KernelRidge(alpha=alpha,kernel = kernel)
krr_model,krr_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)

# Support Vector Machines
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html">Epsilon-Support Vector Regression</a>

In [None]:
from sklearn.svm import SVR

In [None]:
# hyper-parameter
C = 100
kernel='rbf'

In [None]:
model = SVR(C=C,kernel=kernel,gamma='auto')
svr_model,svr_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)

# Random Forest
A <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">Random Forest Regressor</a> is a meta estimator that fits a number of classifying decision trees.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
model = RandomForestRegressor(n_estimators=100)
random_forest_model,random_forest_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)

# Hyper parameters tuning and Cross Validation
Finding the best hyperparameter of interest without writing hundreds of lines of code is an important efficiency gain
<br/>CV is to avoid bias in the performance evaluation
<br/>
For the Tuning a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">Grid Search with Cross Validation</a> is used. <br />
<code>cv :: Determines the cross-validation splitting strategy.</code>

In [None]:
from sklearn.model_selection import GridSearchCV
#Five fold splitting strategy
cv = 5

## Ridge with GridSearchCV

In [None]:
estimator  = linear_model.Ridge()
parameters = {'alpha':np.logspace(-2,2,10)}
model = GridSearchCV(estimator, parameters,cv=cv)

cv_ridge_model,cv_ridge_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)
print(cv_ridge_model.best_params_)

## Lasso with GridSearchCV

In [None]:
estimator  = linear_model.Lasso()
parameters = {'alpha':np.linspace(0.001,10,10)}
model = GridSearchCV(estimator, parameters,cv=cv)

cv_lasso_model,cv_lasso_linear_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)
#print the tuned hyper-parameters
print(cv_lasso_model.best_params_)



## KernelRidge with GridSearchCV

In [None]:
estimator  = KernelRidge()
parameters = {'kernel':['polynomial','rbf'],
              'degree':np.arange(10),
              'alpha':np.logspace(-2,2,5)}
model = GridSearchCV(estimator, parameters,cv=cv)

cv_krr_model,cv_krr_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)
print(cv_krr_model.best_params_)

## Epsilon-Support Vector Regression with GridSearchCV

In [None]:
estimator  = SVR(gamma='auto')
parameters = {'kernel':['rbf'],
              'C':np.logspace(-2,2,5)}
model = GridSearchCV(estimator, parameters,cv=cv)

cv_svr_model,cv_svr_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)
print(cv_svr_model.best_params_)

## Random Forest Regressor with GridSearchCV

In [None]:
estimator  = RandomForestRegressor()
parameters = {'n_estimators':[10,100,1000]}
model = GridSearchCV(estimator, parameters,cv=cv)

cv_rf_model,cv_rf_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)
print(cv_rf_model.best_params_)

## Results comparison

In [None]:
#print out the results in a table
from IPython.display import Markdown as md
from IPython.display import display


table = '<table><tr><th> Model</th><th> Accuracy Metric </th></tr>'

for key, value in Score_Dict.items():
    table +='<tr> <td>'+key+'</td><td>' +'%.2f'%(value)+'</td></tr>'
table+='</table>'
display(md(table))


names = list(Score_Dict.keys())
values = list(Score_Dict.values())

plt.figure(figsize=(15, 3))
plt.bar(names, values)
plt.ylabel('Accuracy Metric')
plt.xticks(rotation=30)
plt.grid()
