# ML Application Exercize
## Regression

The task of this exercize is to implement a complete Data Driven pipeline (load, data-analysis, visualisation, model selection and optimization, prediction) on a specific Dataset. In this exercize the challenge is to perform a regression with different models to find the most accurate prediction.  


## Dataset 
The notebook will upload a public available dataset: https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise
<blockquote>
  <b>Source:</b>
    Donor: Dr Roberto Lopez robertolopez@intelnics.com Intelnics
    <br/>
    Creators: Thomas F. Brooks, D. Stuart Pope and Michael A. Marcolini NASA
    <br/>
    <b>Data Set Information:</b>
    NASA data set, obtained from a series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections 
    conducted in an anechoic wind tunnel.
    The NASA data set comprises different size NACA 0012 airfoils at various wind tunnel speeds and angles of attack. The span of the 
    airfoil and the observer position were the same in all of the experiments. 
    <br/>
    <b>Attribute Information:</b>
    This problem has the following inputs: 
    <code><br/> 1. Frequency, in Hertzs.                          </code>
    <code><br/> 2. Angle of attack, in degrees.                   </code>
    <code><br/> 3. Chord length, in meters.                       </code>
    <code><br/> 4. Free-stream velocity, in meters per second.    </code>
    <code><br/> 5. Suction side displacement thickness, in meters.</code>
    <br/>
    The only output is:
    <code><br/>6. Scaled sound pressure level, in decibels.        </code>
</blockquote>

In [None]:
# algebra
import numpy as np
# data structure
import pandas as pd
# data visualization
import matplotlib.pylab as plt
import seaborn as sns
#file handling
from pathlib import Path


# Data load
The process consist in downloading the data if needed, loading the data as a Pandas dataframe

In [None]:
    
filename  = "airfoil_self_noise.dat"
separator = '\t'
columns   = None

#if the dataset is not already in the working dir, it will download
my_file = Path(filename)
if not my_file.is_file():
  print("Downloading dataset")
  !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat


In [None]:
#function to semplificate the load of dataset, in case it is a csv, tsv or excel file
#output is a pandas dataframe 
def load_csv(filename,separator,columns):
    
    try:
    
        csv_table = pd.read_csv(filename,sep=separator,names=columns)
    
    except:
        
        csv_table = pd.read_excel(filename,names=columns)
    print("n. samples: {}".format(csv_table.shape[0]))
    print("n. columns: {}".format(csv_table.shape[1]))

    return csv_table #.dropna()

data = load_csv(filename,separator,columns)
#remove any anomalous value (e.g. "nan")
data = data.dropna()

# Data Analysis and Visualization
## Task:
* check the first entries of the dataframe
* print the names of all columns
* Plot pairwise relationships of the dataset (<a href="https://seaborn.pydata.org/generated/seaborn.pairplot.html"> hint</a>)
* Plot the correlation matrix (<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html"> Dataframe Correlation</a>, <a href="https://seaborn.pydata.org/generated/seaborn.heatmap.html">heatmap</a>)

In [None]:
#check the first entries of the dataframe


In [None]:
#print the names of all columns


In [None]:
#Plot pairwise relationships of the dataset


In [None]:
#Calculate and plot the correlation matrix


# Machine Learning
Here the interesting input features and output to predict for the task are selected, the data are opportunelly preprocessed (i.e. normalized), the dataset is splitted in two separate train and test subsets, each model is trained on the training data and evaluated against a test set.<br/>
The evaluation metrics list can be found <a href='https://scikit-learn.org/stable/modules/model_evaluation.html'>here</a>

## Task
* import a metric to evaluate the models (for ex. median_absolute_error, or choose from this <a href='https://scikit-learn.org/stable/modules/model_evaluation.html'>list</a>)
* Define a Scaler to standardize the input features and output and apply the transformation to better fit the models

In [None]:
#the module needed for the modeling and data mining are imported
#Cross-Validation 
from sklearn.model_selection import train_test_split
#Data normalization
from sklearn.preprocessing   import StandardScaler


In [None]:
#import a metric to evaluate the model
#from sklearn... 



In [None]:
#Selection of feature and output variable, definition of the size (fraction of the total) of the random selected test set
input_features = ['Frequency','Angle of attack','Chord length','Free-stream velocity','Suction side displacement thickness']
output         = ['Scaled sound pressure level']
test_size      = 0.33
random_state   = 0

#not preprocessed data
unnormalized_X,unnormalized_y = data[input_features],data[output]

In [None]:
#Standardisation
#Define a Scaler to standardize the input features and output


#standardize the inputs and outputs



In [None]:
# basic train-test dataset random split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=test_size,
                                                    random_state=random_state)

In [None]:
#dictionary to help the display of the results
Score_Dict = {}

#function introduced to simplifies the following comparison and test of the various
#return the trained model and the score of the selected metrics
def fit_predict_plot(model,X_train,y_train,X_test,y_test):
    model.fit(X_train,y_train.ravel())

    pred_normalized_y_test = model.predict(X_test)
    pred_y_test            = scaler_y.inverse_transform(pred_normalized_y_test)
    real_y_test            = scaler_y.inverse_transform(y_test)

#Alternative metrics are listed here:https://scikit-learn.org/stable/modules/model_evaluation.html
    mse_score = median_absolute_error(real_y_test,pred_y_test)
    
    model_name = type(model).__name__
    if(model_name=='GridSearchCV'):
        model_name ='CV_'+type(model.estimator).__name__
    
    Score_Dict[model_name]=mse_score

    plt.figure(figsize=[5,5])
    plt.scatter(real_y_test,pred_y_test)
    plt.plot([real_y_test.min(),real_y_test.max()],[real_y_test.min(),real_y_test.max()],'k:')
    plt.axis('equal')
    plt.title("Median Absolute Error: {:.2f}".format(mse_score))
    plt.xlabel('True Scaled sound pressure level')
    plt.ylabel('Predicted Scaled sound pressure level')
    
    return model,mse_score



# Model implementation 
As example a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html">linear regression</a> is implemented and the results are shown. 

## Tasks
* Implement a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html">Lasso Regression</a> 
    * hyper-parameter alpha = 0.1
    * Remember what we have seen regarding Lasso and regularisation (hint: check the trained coefficients)
* Implement a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html">Epsilon-Support Vector Regression</a> with the following hyper-parameters:
    * C = 100
    * kernel='rbf'
    * gamma = 'auto'
* Implement a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">Random Forest Regressor</a> with the following hyper-parameter:
    * n_estimators=100



In [None]:
#Import the module that allows to access the Linear Regression, Lasso and Ridge algorithm
from sklearn import linear_model

In [None]:
#initialization, fit and evaluation of the model
model = linear_model.LinearRegression()
basic_linear_model, basic_linear_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)

#check the output of the model
print(basic_linear_model.coef_)
print(basic_linear_model.intercept_)



# Lasso


In [None]:
#Hyperparameter definition


#initialization, fit and evaluation of the model


#Check the trained parameters (coefficients)




# Support Vector Machines
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html">Epsilon-Support Vector Regression</a>

In [None]:
from sklearn.svm import SVR

In [None]:
#Import the correct module from scikit-learn


# hyper-parameter definition


#initialization, fit and evaluation of the model




# Random Forest
A <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html">Random Forest Regressor</a> is a meta estimator that fits a number of classifying decision trees.

In [None]:
#Import the correct module from scikit-learn


# hyper-parameter definition


#initialization, fit and evaluation of the model




# Hyper parameters tuning and Cross Validation
Finding the best hyperparameter of interest without writing hundreds of lines of code is an important efficiency gain
<br/>CV is to avoid bias in the performance evaluation
<br/>
For the Tuning a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">Grid Search with Cross Validation</a> is used. <br />
<code>cv :: Determines the cross-validation splitting strategy.</code>


As example a gridsearch with cross validation is implemented using:
* estimator = KernelRidge
* hyper-parameters : 
    * kernel: polynomial, rbf
    * degree: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] = np.arange(10)
    * alpha : [0.01, 0.1, 1, 10, 100]        = np.logspace(-2,2,5) 


## Tasks:

* Import the module that automates the search over specified (hyper-)parameter values for an estimator
* Implement a 5-fold splitting strategy for the Cross Validation
* Implement the following Gridsearch models:
* Lasso:
    * alpha = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
    * which alpha is the best? 
* Support Vector:
    * kernel = poly, rbf
    * C = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
* Random Forest:
    * n_estimator: [10.,  120.,  230.,  340.,  450.,  560.,  670.,  780.,  890., 1000.]

In [None]:
#import the module from scikit-learn that perform the grid-search


#setup the five fold splitting strategy for cross-validation



## KernelRidge with GridSearchCV

In [None]:
#define estimator and hyper-parameters range 
estimator  = KernelRidge()
parameters = {'kernel':['polynomial','rbf'],
              'degree':np.arange(10),
              'alpha':np.logspace(-2,2,5)}

#initialized the gridsearch and extract the best model 
model = GridSearchCV(estimator, parameters,cv=cv,iid=False)

#train and evaluate the best model, plot the results 
cv_krr_model,cv_krr_score = fit_predict_plot(model,X_train,y_train,X_test,y_test)
#print best hyper-parameters
print(cv_krr_model.best_params_)

## Lasso with GridSearchCV

In [None]:
#define estimator and hyper-parameters range 



#initialized the gridsearch and extract the best model 


#train and evaluate the best model, plot the results



#print the best hyper-parameters



## Epsilon-Support Vector Regression with GridSearchCV

In [None]:
#define estimator and hyper-parameters range 



#initialized the gridsearch and extract the best model 


#train and evaluate the best model, plot the results



#print the best hyper-parameters



## Random Forest Regressor with GridSearchCV

In [None]:
#define estimator and hyper-parameters range 



#initialized the gridsearch and extract the best model 


#train and evaluate the best model, plot the results



#print the best hyper-parameters



## Results comparison

In [None]:
#print out the results in a table
from IPython.display import Markdown as md
from IPython.display import display


table = '<table><tr><th> Model</th><th> Accuracy Metric </th></tr>'

for key, value in Score_Dict.items():
    table +='<tr> <td>'+key+'</td><td>' +'%.2f'%(value)+'</td></tr>'
table+='</table>'
display(md(table))


names = list(Score_Dict.keys())
values = list(Score_Dict.values())

plt.figure(figsize=(15, 3))
plt.bar(names, values)
plt.ylabel('Accuracy Metric')
plt.xticks(rotation=30)
plt.grid()
