In [1]:
import pandas as pd
import numpy as np
import scipy
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

Loading the dataset.

The dataset is in .dat format.

In [2]:
dataset = pd.read_table("airfoil_self_noise.dat", sep="\s+")
dataset.columns= ['Frequency', 'Angle', 'Chord', 'Velocity', 'Thickness','Sound']
print(dataset)

      Frequency  Angle   Chord  Velocity  Thickness    Sound
0          1000    0.0  0.3048      71.3   0.002663  125.201
1          1250    0.0  0.3048      71.3   0.002663  125.951
2          1600    0.0  0.3048      71.3   0.002663  127.591
3          2000    0.0  0.3048      71.3   0.002663  127.461
4          2500    0.0  0.3048      71.3   0.002663  125.571
...         ...    ...     ...       ...        ...      ...
1497       2500   15.6  0.1016      39.6   0.052849  110.264
1498       3150   15.6  0.1016      39.6   0.052849  109.254
1499       4000   15.6  0.1016      39.6   0.052849  106.604
1500       5000   15.6  0.1016      39.6   0.052849  106.224
1501       6300   15.6  0.1016      39.6   0.052849  104.204

[1502 rows x 6 columns]


Train test split

Of course the train-test split could be done at the regression model declaration stage, but I like this way better.

In [3]:
train, test = train_test_split(dataset, test_size=0.2, random_state=8)
test.reset_index(drop=True)
xtrain = train.iloc[:, 0:-1]
ytrain = train.iloc[:,-1]
xtest = test.iloc[:, 0:-1]
ytest = test.iloc[:,-1]

Model without hyperparameter tuning

In [4]:
rf_regressor = RandomForestRegressor(random_state=34)
rf_regressor.fit(xtrain,ytrain)
ypred_rf = rf_regressor.predict(xtest)

Model with hyperparameter tuning

In [5]:
rf_regressor_ht = RandomForestRegressor(random_state=34)
parameters = {'n_estimators': [50,100,200,300],
              'max_features': [2,4,5, 'auto'],
              'max_depth': [3,5,8, None]                   
              }

gridsearch = GridSearchCV(rf_regressor_ht, param_grid=parameters)

RF_model = gridsearch.fit(xtrain,ytrain)
ypred_rf_ht = gridsearch.predict(xtest)

RF_model.best_params_

{'max_depth': None, 'max_features': 4, 'n_estimators': 300}

Evaluating model performance

Root Mean squared error has only explanatory value when compared to other models' RMSE and a better model has smaller RMSE, that means, that the standard deviation of the residuals are smaller, therefore it is a better fit. 

R-square however shows how much percentage of the dependent variable can be explained by the independent variable. R-squared also takes overfitting into consideration and does not necessarily show a better value if multiple dependent variables are involved.

In [6]:
# first model

rmse_rf_1 = np.sqrt(mean_squared_error(ytest, ypred_rf))
rsq2 = rf_regressor.score(xtest, ytest)
print("1st model RMSE:", rmse_rf_1)
print("1st model R-square:",rsq2)

# second model

rmse_rf_2 = np.sqrt(mean_squared_error(ytest, ypred_rf_ht))
rsq2 = RF_model.score(xtest, ytest)
print("2nd model RMSE:", rmse_rf_1)
print("2nd model R-square:",rsq2)

1st model RMSE: 1.6642913569028208
1st model R-square: 0.9408484472228751
2nd model RMSE: 1.6642913569028208
2nd model R-square: 0.9460831447560455


Cross validation

With cross validation we can test the performance of the model in a way that we split the data into 5 batches (batch size can be freely determined) and we test the performance 5 times - each time a different batch will be the test set while all the other batches are the training set.

I only do cross validation for the second model.

In [7]:
rf_model = RandomForestRegressor(n_estimators=gridsearch.best_estimator_.n_estimators,
max_features=gridsearch.best_estimator_.max_features,
max_depth= gridsearch.best_estimator_.max_depth
)

scores = cross_val_score(rf_model, xtrain, ytrain, cv=5)
print("CV Scores: ", scores)
print("Mean CV Score: ", scores.mean())

CV Scores:  [0.91230905 0.93212618 0.9268391  0.91968995 0.91286847]
Mean CV Score:  0.9207665491599718


Conclusion: with the help of the Random Forest regressor there is chance to give a quite accurate prediction to the noise level given the independent variables.