Before running this, please make sure to activate and instantiate the environment
corresponding to [this `Project.toml`](https://raw.githubusercontent.com/alan-turing-institute/MLJTutorials/master/Project.toml) and [this `Manifest.toml`](https://raw.githubusercontent.com/alan-turing-institute/MLJTutorials/master/Manifest.toml)
so that you get an environment which matches the one used to generate the tutorials:

```julia
cd("MLJTutorials") # cd to folder with the *.toml
using Pkg; Pkg.activate("."); Pkg.instantiate()
```

## Stock market data

Let's load the usual packages and the data

In [None]:
using MLJ, RDatasets, ScientificTypes,
      DataFrames, Statistics, StatsBase

smarket = dataset("ISLR", "Smarket")
@show size(smarket)
@show names(smarket)

Let's get a description too

In [None]:
describe(smarket, :mean, :std, :eltype)

The target variable is `:Direction`:

In [None]:
y = smarket.Direction
X = select(smarket, Not(:Direction));

We can compute all the pairwise correlations; we use `Matrix` so that the dataframe entries are considered as one matrix of numbers (otherwise `cor` won't work):

In [None]:
cm = X |> Matrix |> cor
round.(cm, sigdigits=1)

Let's see what the `:Volume` feature looks like:

In [None]:
using PyPlot
figure(figsize=(8,6))
plot(X.Volume)
xlabel("Tick number", fontsize=14)
ylabel("Volume", fontsize=14)
xticks(fontsize=12)
yticks(fontsize=12)

savefig("assets/ISL-volume.svg") # hide

![volume](/assets/ISL-volume.svg)

### Logistic Regression

We will now try to train models; the target `:Direction` has two classes: `Up` and `Down`; it needs to be interpreted as a Multiclass object first:

In [None]:
yc = coerce(y, Multiclass)
unique(yc)

Let's now try fitting a simple logistic classifier (aka logistic regression) not using `:Year` and `:Today`:

In [None]:
@load LogisticClassifier pkg=MLJLinearModels
X2 = select(X, Not([:Year, :Today]))
clf = machine(LogisticClassifier(), X2, y)

Let's fit it to the data and try to reproduce the output:

In [None]:
fit!(clf)
ŷ = predict(clf, X2)
cross_entropy(ŷ, y) |> mean

Note that here the `ŷ` are _scores_; in order to recover the class, we could use the mode and compare the misclassification rate:

In [None]:
ŷ = predict_mode(clf, X2)
misclassification_rate(ŷ, y)

Well that's not fantastic...

Let's visualise how we're doing building a confusion matrix manually,
first is predicted, second is truth:

In [None]:
TN = down_down = sum(ŷ .== y .== "Down")
FN = down_up = sum(ŷ .!= y .== "Up")
FP = up_down = sum(ŷ .!= y .== "Down")
TP = up_up = sum(ŷ .== y .== "Up")

conf_mat = [down_down down_up; up_down up_up]

We can then compute the accuracy or precision easily for instance:

In [None]:
acc = (TN + TP) / length(y)
prec = TP /  (TP + FP)
rec  = TP / (TP + FN)
@show round(acc, sigdigits=3)
@show round(prec, sigdigits=3)
@show round(rec, sigdigits=3)

Let's now train on the data before 2005 and use it to predict on the rest.

### LDA

### QDA

_QDA is not yet supported_

### KNN

## Caravan insurance data

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*