## Predicting Crab Age from other Variables
<img src="../figures/mlj.svg" width=100 height=100>

We now wish to predict the crab age from our various variables. Initially, we will start with a simple linear regression. For this purpose, we will use the `MLJ` package, which can be seen as a `scikit-learn` equivalent, containing many different models. First of all, we will load the MLJ package.

In [1]:
using MLJ
using DataFrames
using CSV
using Plots
using Statistics

We'll load in the crab dataset from the previous part.

In [2]:
location = "../datasets/CrabAge.csv"
dataset = DataFrame(CSV.File(location))
dataset.Age = Float64.(dataset.Age) # convert to float for later
# show the first five lines of the dataset
first(dataset, 5)

Row,Sex,Length,Diameter,Height,Weight,ShuckedWeight,VisceraWeight,ShellWeight,Age
Unnamed: 0_level_1,String1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,F,1.4375,1.175,0.4125,24.6357,12.332,5.58485,6.74718,9.0
2,M,0.8875,0.65,0.2125,5.40058,2.29631,1.37495,1.55922,6.0
3,I,1.0375,0.775,0.25,7.95203,3.23184,1.60175,2.76408,6.0
4,F,1.175,0.8875,0.25,13.4802,4.74854,2.28213,5.24466,10.0
5,I,0.8875,0.6625,0.2125,6.9031,3.45864,1.48835,1.70097,6.0


We need to split our data into train and validation sets. This can be easily done using the `partition` function.

In [3]:
train_rows, test_rows = partition(1:nrows(dataset), 0.8, rng=129);

We then load the `LinearRegressor` for performing linear regression. In this case, we opt for the `MLJLinearModels` flavour.

In [4]:
LinearRegressor = @load LinearRegressor pkg=MLJLinearModels

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /Users/max/.julia/packages/MLJModels/7apZ3/src/loading.jl:159


import MLJLinearModels

 ✔


MLJLinearModels.LinearRegressor

We then create a model instance using and bind this to training data. Then we can call `fit!()` to perform regression.

In [5]:
model = LinearRegressor() # default hyperparameters
mach = machine(model, dataset[train_rows, 2:8], dataset[train_rows, :Age])
fit!(mach, verbosity=0);

We can also train and evaluate on the respective train and test sets in one call.

In [6]:
evaluate(model, dataset[:, 2:8], dataset[!, :Age], resampling = [(train_rows, test_rows)], measure=l2, verbosity=0)

PerformanceEvaluation object with these fields:
  model, measure, operation, measurement, per_fold,
  per_observation, fitted_params_per_fold,
  report_per_fold, train_test_rows, resampling, repeats
Extract:
┌──────────┬───────────┬─────────────┬──────────┐
│[22m measure  [0m│[22m operation [0m│[22m measurement [0m│[22m per_fold [0m│
├──────────┼───────────┼─────────────┼──────────┤
│ LPLoss(  │ predict   │ 5.71        │ [5.71]   │
│   p = 2) │           │             │          │
└──────────┴───────────┴─────────────┴──────────┘


The `MLJ` package has a huge amount of functionalities, which will not be discussed here, but can be found in their [documentation](https://alan-turing-institute.github.io/MLJ.jl/dev/). 

---
Exercise 3
---

1. Here we trained a linear regression model using all features. However, our correlation plot in the previous part suggested that some features may not be as informative. Can you find a model with less features that performs (nearly) as well?
2. Look to the MLJ documentation and try to train another type of regressor on the data (either from MLJLinearModels or MLJScikitLearnInterface)

In [None]:
# Expecting great things

---