Before running this, please make sure to activate and instantiate the
tutorial-specific package environment, using this
[`Project.toml`](https://raw.githubusercontent.com/juliaai/DataScienceTutorials.jl/gh-pages/__generated/EX-airfoil/Project.toml) and
[this `Manifest.toml`](https://raw.githubusercontent.com/juliaai/DataScienceTutorials.jl/gh-pages/__generated/EX-airfoil/Manifest.toml), or by following
[these](https://juliaai.github.io/DataScienceTutorials.jl/#learning_by_doing) detailed instructions.

**Main author**: [Ashrya Agrawal](https://github.com/ashryaagr).

@@dropdown
## Getting started
@@
@@dropdown-content
Here we use the [UCI "Airfoil Self-Noise" dataset](http://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise)

@@dropdown
### Loading and  preparing the data
@@
@@dropdown-content

In [None]:
using MLJ
using PrettyPrinting
import DataFrames
import Statistics
using CSV
using HTTP
using StableRNGs


req = HTTP.get("https://raw.githubusercontent.com/rupakc/UCI-Data-Analysis/master/Airfoil%20Dataset/airfoil_self_noise.dat");

df = CSV.read(req.body, DataFrames.DataFrame; header=[
                   "Frequency","Attack_Angle","Chord+Length",
                   "Free_Velocity","Suction_Side","Scaled_Sound"
                   ]
              );
df[1:5, :] |> pretty

inspect the schema:

In [None]:
schema(df)

unpack into the data and labels:

In [None]:
y, X = unpack(df, ==(:Scaled_Sound));

Now we Standardize the features using the transformer Standardizer()

In [None]:
X = MLJ.transform(fit!(machine(Standardizer(), X)), X);

Partition into train and test set

In [None]:
train, test = partition(collect(eachindex(y)), 0.7, shuffle=true, rng=StableRNG(612));

Let's first see which models are compatible with the scientific type and machine type of our data

In [None]:
for model in models(matching(X, y))
       print("Model Name: " , model.name , " , Package: " , model.package_name , "\n")
end

Note that if we coerce `X.Frequency` to `Continuous`, many more models are available:

In [None]:
coerce!(X, :Frequency=>Continuous)

for model in models(matching(X, y))
       print("Model Name: " , model.name , " , Package: " , model.package_name , "\n")
end

‎
@@

‎
@@
@@dropdown
## DecisionTreeRegressor
@@
@@dropdown-content

We will first try out DecisionTreeRegressor:

In [None]:
DecisionTreeRegressor = @load DecisionTreeRegressor pkg=DecisionTree

dcrm = machine(DecisionTreeRegressor(), X, y)

fit!(dcrm, rows=train)
pred_dcrm = predict(dcrm, rows=test);

Now you can call a loss function to assess the performance on test set.

In [None]:
rms(pred_dcrm, y[test])

‎
@@
@@dropdown
## RandomForestRegressor
@@
@@dropdown-content

Now let's try out RandomForestRegressor:

In [None]:
RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree
rfr = RandomForestRegressor()

rfr_m = machine(rfr, X, y);

train on the rows corresponding to train

In [None]:
fit!(rfr_m, rows=train);

predict values on the rows corresponding to test

In [None]:
pred_rfr = predict(rfr_m, rows=test);
rms(pred_rfr, y[test])

Unsurprisingly, the RandomForestRegressor does a better job.

Can we do even better? Yeah, we can!! We can make use of Model Tuning.

‎
@@
@@dropdown
## Tuning
@@
@@dropdown-content

In case you are new to model tuning using MLJ, refer [lab5](https://alan-turing-institute.github.io/DataScienceTutorials.jl/isl/lab-5/) and [model-tuning](https://alan-turing-institute.github.io/DataScienceTutorials.jl/getting-started/model-tuning/)

Range of values for parameters should be specified to do hyperparameter tuning

In [None]:
r_maxD = range(rfr, :n_trees, lower=9, upper=15)
r_samF = range(rfr, :sampling_fraction, lower=0.6, upper=0.8)
r = [r_maxD, r_samF];

Now we specify how the tuning should be done. Let's just specify a coarse grid tuning with cross validation and instantiate a tuned model:

In [None]:
tuning = Grid(resolution=7)
resampling = CV(nfolds=6)

tm = TunedModel(model=rfr, tuning=tuning,
                resampling=resampling, ranges=r, measure=rms)

rfr_tm = machine(tm, X, y);

train on the rows corresponding to train

In [None]:
fit!(rfr_tm, rows=train);

predict values on the rows corresponding to test

In [None]:
pred_rfr_tm = predict(rfr_tm, rows=test);
rms(pred_rfr_tm, y[test])

That was great! We have further improved the accuracy

Now to retrieve best model, You can use

In [None]:
fitted_params(rfr_tm).best_model

Let's visualize the tuning results:

In [None]:
using Plots
plot(rfr_tm)

\figalt{Hyperparameter heatmap}{airfoil_heatmap.svg}

‎
@@

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*