# Airfoil Self Noise

The NASA dataset comprises different size NACA 0012 airfoils at various wind tunnel speeds and angles of attack. The span of the airfoil and the observer position were the same in all of the experiments.

Polynomial Regression has given the best fit so far. The limited features mean that a decision would not be worthwhile. It likely does not have enough features to be accurate. 

## Index

1. [Data Preprocessing](http://localhost:8888/notebooks/MachineLearningModels/Airfoil_Self_Noise.ipynb#Data-Preprocessing)
2. [Train Model](http://localhost:8888/notebooks/MachineLearningModels/Airfoil_Self_Noise.ipynb#Training-Polynomial-model)
3. [Prediction, Equation, and Evaluation](http://localhost:8888/notebooks/MachineLearningModels/Airfoil_Self_Noise.ipynb#Prediction,-final-equation,-and-evaluation)
4. [Conclusion](http://localhost:8888/notebooks/MachineLearningModels/Airfoil_Self_Noise.ipynb#Conclusion)
5. [Citation](http://localhost:8888/notebooks/MachineLearningModels/Airfoil_Self_Noise.ipynb#References)

## Data Preprocessing

### Import libraries

Numpy, matplotlib, and pandas are the big three for Python

In [19]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Import data

As long as the value we're solving for is the last one this is a breeze.

In [20]:
dataset = pd.read_csv("/home/joe/Documents/ML-Resources/airfoil_self_noise.csv")
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

### Taking care of missing values

The original dataset had roughly 9018 values, once a dataset gets kinda large it makes sense to just assume it has bad values. Don't even bother checking because sample is random, and head and tail only show a few entries at the beginning or end of the file. Worst case scenario this is a step you don't need.

In [21]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

### Split datasets

The dataset needs to be split to prevent overtraining. Typically a 70/30 or 80/20 split is recommended. I just stick with 80/20 nowadays. If the test split is too small the model will be too far off and if it is too large it'll be overfit to the data.

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

## Training Polynomial model

I tried a few models here, I typically prefer to try with MLR - Multilinear Regression - to get a feel for the coefficients and r-score. 

Note that the degree here is set to 3, the same number of independent columns or coefficients. That's sort of coincidental and at the same time sort of determined by the data. There are enough coefficients that I can make the degree of the equation fit the number of coefficients. Typically a degree of 4 or 5 is the highest you want to go, but I've noticed sometimes the best fitting models are ones where the degree of the equation is equal to the number of coefficients.

In [39]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 3)
X_poly = poly_reg.fit_transform(X_train)
regressor = LinearRegression()
regressor.fit(X_poly, Y_train)

## Prediction, final equation, and evaluation

### Prediction output

The left-hand values are predicted values. The right hand values are values from the test set.

In [40]:
y_pred = regressor.predict(poly_reg.transform(X_test))
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1)),1))

[[119.64 117.74]
 [115.76 118.12]
 [125.24 120.66]
 [129.1  122.23]
 [130.17 129.34]
 [129.58 126.59]
 [131.55 133.44]
 [129.92 131.58]
 [125.94 111.91]
 [127.56 129.97]
 [123.31 118.62]
 [125.1  126.34]
 [121.2  123.92]
 [126.35 129.  ]
 [104.3  108.69]
 [123.64 125.4 ]
 [116.5  117.78]
 [124.01 123.25]
 [130.62 132.3 ]
 [124.66 125.72]
 [129.21 135.54]
 [120.85 119.56]
 [118.5  110.45]
 [128.64 123.74]
 [125.75 127.63]
 [127.06 124.76]
 [130.83 131.72]
 [125.93 123.69]
 [126.14 129.98]
 [129.6  128.52]
 [124.23 126.54]
 [126.61 125.8 ]
 [129.33 128.25]
 [128.18 130.96]
 [129.16 126.67]
 [130.89 131.24]
 [124.98 126.54]
 [126.25 125.5 ]
 [127.54 129.09]
 [131.58 133.38]
 [124.65 124.53]
 [128.08 128.71]
 [127.47 128.81]
 [128.36 123.76]
 [124.31 130.  ]
 [123.46 121.66]
 [124.22 124.45]
 [123.48 128.2 ]
 [123.9  120.04]
 [120.45 124.3 ]
 [123.26 121.77]
 [130.46 133.04]
 [128.15 131.45]
 [130.75 119.51]
 [132.33 135.87]
 [125.94 114.04]
 [127.85 129.38]
 [126.21 121.55]
 [108.7  111.5

### Final equation

This'll give the coefficients and intercept values. This is far more useful with multiple linear regression than here, but I left it anyway.

In [41]:
print(regressor.coef_)
print(regressor.intercept_)

[ 0.00e+00  3.30e-03  1.17e+02 -1.94e+01 -2.96e-07 -3.13e-02 -1.99e-01
 -5.84e+02 -4.70e+02 -4.48e+01  6.91e-12  7.91e-07  9.32e-06  5.14e-02
 -9.23e-02  1.74e+00  8.46e+02 -1.32e+02 -2.58e+01 -6.03e+00]
123.96018648733657


The values below were found with Multilinear Regression, and its r-score was very close to the the Polynomial Model's r-score:

Scaled sound pressure level = frequency (hertz) - 0.00128 x angle of attack (degrees) - 0.426 x
Chord length (meters) - 36.2 x Free-stream velocity + 0.1 x Suction side displacement thickness (meters) - 151 + 133.06056810486479 

Free stream velocity and angle of attack contribute very litle to the model's accuracy. Removing them both only knocked the r-score down a few points.

### Evaluate the model

The closer the r-score is to 1 the better. Negative values are rare, and indicate something is horribly wrong. Typically, the r-score should be between 0 and 1. If you get around 0.7 in the real world you have a strong correlation. 

In [42]:
from sklearn.metrics import r2_score
r2_score(Y_test, y_pred)

0.6059162515530709

## Conclusion

It seems the physical attributes are most closely correlated to the sound pressure created by an airfoil.

<h2 align=center>References</h2>

Dataset: https://archive.ics.uci.edu/ml/datasets/airfoil+self-noise#

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Donor: Dr Roberto Lopez robertolopez '@' intelnics.com Intelnics

Creators: Thomas F. Brooks, D. Stuart Pope and Michael A. Marcolini NASA