In [18]:
import pandas as pd
import numpy as np
import scipy
from sklearn.model_selection import train_test_split

from sklearn.ensemble import AdaBoostRegressor

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

Loading the dataset

The dataset is in .dat format.

In [19]:
dataset = pd.read_table("airfoil_self_noise.dat", sep="\s+")
dataset.columns= ['Freq', 'Angle', 'Chord', 'Velocity', 'Thickness','Sound']
dataset.head(10)

Unnamed: 0,Freq,Angle,Chord,Velocity,Thickness,Sound
0,1000,0.0,0.3048,71.3,0.002663,125.201
1,1250,0.0,0.3048,71.3,0.002663,125.951
2,1600,0.0,0.3048,71.3,0.002663,127.591
3,2000,0.0,0.3048,71.3,0.002663,127.461
4,2500,0.0,0.3048,71.3,0.002663,125.571
5,3150,0.0,0.3048,71.3,0.002663,125.201
6,4000,0.0,0.3048,71.3,0.002663,123.061
7,5000,0.0,0.3048,71.3,0.002663,121.301
8,6300,0.0,0.3048,71.3,0.002663,119.541
9,8000,0.0,0.3048,71.3,0.002663,117.151


Train-test split

Of course the train-test split could be done at the regression model declaration stage, but I like this way better.

In [20]:
train, test = train_test_split(dataset, test_size=0.2, random_state=8)
test.reset_index(drop=True)
xtrain = train.iloc[:, 0:-1]
ytrain = train.iloc[:,-1]
xtest = test.iloc[:, 0:-1]
ytest = test.iloc[:,-1]

Model without hyperparameter tuning

In [21]:
ada_regressor = AdaBoostRegressor()
ada_regressor.fit(xtrain, ytrain)
ypred_ada = ada_regressor.predict(xtest)

Model with hyperparameter tuning

In [22]:
ada_regressor_ht = AdaBoostRegressor()
ada_parameters = {'n_estimators': [50,80,120,150,200],
                  'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.3, 0.5, 1],
                  'loss': ['linear', 'square', 'exponential']}

gridsearch_ada = GridSearchCV(ada_regressor_ht, param_grid=ada_parameters)

ada_regression_model = gridsearch_ada.fit(xtrain,ytrain)
ypred_ada_regression_model = gridsearch_ada.predict(xtest)

ada_regression_model.best_params_

{'learning_rate': 1, 'loss': 'square', 'n_estimators': 80}

Evaluating model performance

Root Mean squared error has only explanatory value when compared to other models' RMSE and a better model has smaller RMSE, that means, that the standard deviation of the residuals are smaller, therefore it is a better fit. 

R-square however shows how much percentage of the dependent variable can be explained by the independent variable. R-squared also takes overfitting into consideration and does not necessarily show a better value if multiple dependent variables are involved.

In [23]:
# First model

RMSE_ada = np.sqrt(mean_squared_error(ytest, ypred_ada))
rsq2 = ada_regressor.score(xtest, ytest)
print("1st model RMSE:", RMSE_ada)
print("1st model R-square:",rsq2)

# second model

RMSE_ada_2 = np.sqrt(mean_squared_error(ytest, ypred_ada_regression_model))
rsq2 = ada_regression_model.score(xtest, ytest)
print("2nd model RMSE:", RMSE_ada_2)
print("2nd model R-square:",rsq2)

1st model RMSE: 3.7886794453967996
1st model R-square: 0.693462820012728
2nd model RMSE: 3.730275958125457
2nd model R-square: 0.702840679531579


Cross Validation

With cross validation we can test the performance of the model in a way that we split the data into 5 batches (batch size can be freely determined) and we test the performance 5 times - each time a different batch will be the test set while all the other batches are the training set.

I only do cross validation for the second model.

In [24]:
scores = cross_val_score(ada_regression_model, xtrain, ytrain, cv=5)
print("CV Scores: ", scores)
print("Mean CV Score: ", scores.mean())

CV Scores:  [0.72973674 0.76485816 0.71478381 0.75103734 0.69913738]
Mean CV Score:  0.7319106864632727


Conclusion: with the help of the adaboost regressor there is chance to give a somewhat accurate prediction to the noise level given the independent variables, but under similar circumstances, the Random Forest regressor provided more accurate predictions.