<a href="https://colab.research.google.com/github/Aditya-Singla/Airfoil-Self-Noise/blob/master/Airfoil_Self_Noise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Importing Libraries**

In [0]:
import pandas as pd

**Importing the Dataset**

In [0]:
dataset = pd.read_excel('airfoil_self_noise.xlsx')

X = dataset.iloc[:,0:-1].values
y = dataset.iloc[:,-1].values

Checking the **independent variable**

In [3]:
print(X)

[[1.00000e+03 0.00000e+00 3.04800e-01 7.13000e+01 2.66337e-03]
 [1.25000e+03 0.00000e+00 3.04800e-01 7.13000e+01 2.66337e-03]
 [1.60000e+03 0.00000e+00 3.04800e-01 7.13000e+01 2.66337e-03]
 ...
 [4.00000e+03 1.56000e+01 1.01600e-01 3.96000e+01 5.28487e-02]
 [5.00000e+03 1.56000e+01 1.01600e-01 3.96000e+01 5.28487e-02]
 [6.30000e+03 1.56000e+01 1.01600e-01 3.96000e+01 5.28487e-02]]


Checking the **dependent variable**



In [4]:
print(y)

[125.201 125.951 127.591 ... 106.604 106.224 104.204]


**No Missing Values** (*as specified by the source* https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise#)

**Splitting the dataset into training set and test set**

In [0]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)



Using **Linear Regression** 

In [6]:
from sklearn.linear_model import LinearRegression
regressor_lr = LinearRegression()
regressor_lr.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Predicting using **Linear Regression**

In [0]:
y_predict_lr = regressor_lr.predict(x_test)

**Evaluation of Linear Regression using R^2**

In [8]:
from sklearn.metrics import r2_score
r2_lr = r2_score(y_test,y_predict_lr)
print(r2_lr)

0.48657611175423365


Evaluating using **K-fold Cross Validation**

In [9]:
from sklearn.model_selection import cross_val_score
accuracy_lr = cross_val_score(regressor_lr, x_train, y_train, cv=10)
print ('Accuracy: {:.2f}%'.format(accuracy_lr.mean()*100))
print ('Standard Deviation: {:.2f}%'.format(accuracy_lr.std()*100))

Accuracy: 50.67%
Standard Deviation: 8.18%


Using **Decision Tree Regression**

In [10]:
from sklearn.tree import DecisionTreeRegressor
regressor_dt = DecisionTreeRegressor(random_state = 0)
regressor_dt.fit(x_train,y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=0, splitter='best')

Predicting using **Decision Tree Regression**

In [0]:
y_predict_dtr = regressor_dt.predict(x_test)

**Evaluation of Decision Tree Regression using R^2**

In [12]:
r2_dtr = r2_score(y_test,y_predict_dtr)
print(r2_dtr)

0.8615590053887587


Evaluating using **K-fold Cross Validation**

In [13]:
accuracy_dt = cross_val_score(regressor_dt, x_train, y_train, cv=10)
print ('Accuracy: {:.2f}%'.format(accuracy_dt.mean()*100))
print ('Standard Deviation: {:.2f}%'.format(accuracy_dt.std()*100))

Accuracy: 84.80%
Standard Deviation: 3.69%


Using **Random Forest Regression**

In [14]:
from sklearn.ensemble import RandomForestRegressor
regressor_rf = RandomForestRegressor(n_estimators = 100, random_state = 0)
regressor_rf.fit(x_train,y_train)


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=0, verbose=0, warm_start=False)

Predicting using **Random Forest Regression**

In [0]:
y_predict_rfr = regressor_rf.predict(x_test)

**Evaluation of Random Forest Regression using R^2**

In [16]:
r2_rfr = r2_score(y_test,y_predict_rfr)
print(r2_rfr)

0.9314870737195178


Evaluating using **K-fold Cross Validation**

In [17]:
accuracy_rf = cross_val_score(regressor_rf, x_train, y_train, cv=10)
print ('Accuracy: {:.2f}%'.format(accuracy_rf.mean()*100))
print ('Standard Deviation: {:.2f}%'.format(accuracy_rf.std()*100))

Accuracy: 92.54%
Standard Deviation: 1.86%


Applying **Feature Scaling** (for **Support Vector Regression**)

In [0]:
from sklearn.preprocessing import StandardScaler
x_sc = StandardScaler()
y_sc = StandardScaler()

y_train = y_train.reshape(len(y_train),-1) # 2-D array for feature scaling of dependent variable

x_train_sc = x_sc.fit_transform(x_train)
y_train_sc = y_sc.fit_transform(y_train)

Using **Support Vector Regression** (since *linear regression did not yield a good result and other non-linear models performed better, we would use kernel svm*)

In [20]:
from sklearn.svm import SVR
regressor_svm = SVR(kernel = 'rbf')
regressor_svm.fit(x_train_sc,y_train_sc)

  y = column_or_1d(y, warn=True)


SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

Predict using **Support Vector Regression**

In [0]:
y_predict_svr = y_sc.inverse_transform(regressor_svm.predict(x_sc.fit_transform(x_test)))

**Evaluation of Support Vector Regression using R^2**

In [22]:
r2_svr = r2_score(y_test,y_predict_svr)
print(r2_svr)

0.7596576978813243


Evaluating using **K-fold Cross Validation**

In [23]:
accuracy_svm = cross_val_score(regressor_svm, x_train_sc, y_train_sc, cv=10)
print ('Accuracy: {:.2f}%'.format(accuracy_svm.mean()*100))
print ('Standard Deviation: {:.2f}%'.format(accuracy_svm.std()*100))

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Accuracy: 77.15%
Standard Deviation: 3.75%


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Using **XGBoost Regressor** (using a tree-based approach as decision trees and random forests gave goood results) 

In [37]:
from xgboost import XGBRegressor
regressor_xgb = XGBRegressor(booster='gbtree')
regressor_xgb.fit(x_train_sc,y_train_sc)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

Predicting using **XG Boost** 

In [0]:
y_predict_xgb = y_sc.inverse_transform(regressor_xgb.predict(x_sc.fit_transform(x_test)))

**Evaluation of XGBoost using R^2**

In [34]:
r2_xgb = r2_score(y_test,y_predict_xgb)
print(r2_xgb)

0.7858662790592736


Evaluating using **K-fold Cross Validation**

In [35]:
accuracy_xgb = cross_val_score(regressor_xgb, x_train_sc, y_train_sc, cv=10)
print ('Accuracy: {:.2f}%'.format(accuracy_xgb.mean()*100))
print ('Standard Deviation: {:.2f}%'.format(accuracy_xgb.std()*100))

Accuracy: 84.85%
Standard Deviation: 2.65%


Based on the R-square test and K-fold cross validation score, we conclude that **Random Forest Regression** is the *best fitting model* and we could now tune it further.

**Tuning Random Forest Model**

In [38]:
from sklearn.model_selection import GridSearchCV
parameters = [{'n_estimators':[50,75,100,125,150,175,200,225,250,275,300,325,350,375,400],'max_features':['auto','sqrt','log2'] ,'random_state': [0]}]

grid_search = GridSearchCV(estimator = regressor_rf, param_grid = parameters, scoring = 'r2', cv = 10, n_jobs =-1)

grid_search.fit(x_train, y_train)

best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)

  self.best_estimator_.fit(X, y, **fit_params)


Best Accuracy: 92.84 %
Best Parameters: {'max_features': 'auto', 'n_estimators': 300, 'random_state': 0}


Let me a bit more greedy with accuracy!

In [40]:
parameters_new = [{'n_estimators':[280,285,290,295,300,305,310,315,320],'max_features':['auto'] ,'random_state': [0]}]

grid_search_new = GridSearchCV(estimator = regressor_rf, param_grid = parameters_new, scoring = 'r2', cv = 10, n_jobs =-1)

grid_search_new.fit(x_train, y_train)

best_accuracy_new = grid_search_new.best_score_
best_parameters_new = grid_search_new.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy_new*100))
print("Best Parameters:", best_parameters_new)

  self.best_estimator_.fit(X, y, **fit_params)


Best Accuracy: 92.84 %
Best Parameters: {'max_features': 'auto', 'n_estimators': 300, 'random_state': 0}


Well, 300 estimators is what is best.

**Just for fun!** (trying a neural network for first time, excuse my lack of knowledge and technical depth )

In [24]:
from sklearn.neural_network import MLPRegressor
regressor_neural_network = MLPRegressor(random_state = 0)
regressor_neural_network.fit(x_train_sc, y_train_sc)

  y = column_or_1d(y, warn=True)


MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(100,), learning_rate='constant',
             learning_rate_init=0.001, max_fun=15000, max_iter=200,
             momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
             power_t=0.5, random_state=0, shuffle=True, solver='adam',
             tol=0.0001, validation_fraction=0.1, verbose=False,
             warm_start=False)

In [0]:
y_predict_neural_network = y_sc.inverse_transform(regressor_neural_network.predict(x_sc.fit_transform(x_test)))

In [26]:
r2_neural_network = r2_score(y_test,y_predict_neural_network)
print(r2_neural_network)

0.6910931349017078
