## Overview

This example shows how the Iris dataset is loaded and used with existing Julia tools within the OAR project, which can be adapted for other Julia projects.
Other scripts within this project utilize higher-level functions for loading, transforming, and splitting the data automatically, and this example shows how this is done at a low-level.

## Setup

First, we load some dependencies:

In [1]:
# Multi-line using statements are permitted in Julia to gather all requirements and compile at once
using
    OAR,                # This project
    MLDatasets,         # Iris dataset
    MLDataUtils         # Data utilities, splitting, etc.

## Loading the Dataset

We will download the Iris dataset for its small size and benchmark use for clustering algorithms.

In [2]:
iris = Iris()

dataset Iris:
  metadata   =>    Dict{String, Any} with 4 entries
  features   =>    150×4 DataFrame
  targets    =>    150×1 DataFrame
  dataframe  =>    150×5 DataFrame

Next, we manipulate the features and labels into a matrix of features and a vector of labels

In [3]:
features, labels = Matrix(iris.features)', vec(Matrix{String}(iris.targets))

([5.1 4.9 … 6.2 5.9; 3.5 3.0 … 3.4 3.0; 1.4 1.4 … 5.4 5.1; 0.2 0.2 … 2.3 1.8], ["Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-setosa"  …  "Iris-virginica", "Iris-virginica", "Iris-virginica", "Iris-virginica", "Iris-virginica", "Iris-virginica", "Iris-virginica", "Iris-virginica", "Iris-virginica", "Iris-virginica"])

Because the MLDatasets package gives us Iris labels as strings, we will use the `MLDataUtils.convertlabel` method with the `MLLabelUtils.LabelEnc.Indices` type to get a list of integers representing each class:

In [4]:
labels = convertlabel(LabelEnc.Indices{Int}, labels)
unique(labels)

3-element Vector{Int64}:
 1
 2
 3

Next, we will create a train/test split with the `MLDataUtils.stratifiedobs` utility:

In [5]:
(X_train, y_train), (X_test, y_test) = stratifiedobs((features, labels))

(([6.3 5.8 … 4.4 6.3; 2.8 2.7 … 3.2 3.3; 5.1 3.9 … 1.3 6.0; 1.5 1.2 … 0.2 2.5], [3, 2, 3, 2, 1, 1, 3, 1, 1, 3  …  1, 1, 3, 1, 2, 1, 1, 2, 1, 3]), ([6.3 4.4 … 7.2 5.9; 2.5 2.9 … 3.2 3.0; 4.9 1.4 … 6.0 5.1; 1.5 0.2 … 1.8 1.8], [2, 1, 3, 1, 2, 1, 3, 3, 1, 2  …  2, 3, 3, 1, 1, 2, 1, 3, 3, 3]))

We now have a train/test split of the features and targets for the Iris dataset.
This project also defines some low-level data utilities for more easily passing around and transforming this data, so we often see this train/test split as a combined `DataSplit` struct:

In [6]:
data = OAR.DataSplit(X_train, X_test, y_train, y_test)

OAR.DataSplit: dim=4, n_train=105, n_test=45:
train_x: (4, 105) Matrix{Float64}
test_x: (4, 45) Matrix{Float64}
train_y: (105,) Vector{Int64}
test_y: (45,) Vector{Int64}


We can also turn this `DataSplit` into a vectored variant (where the features are arranged as a vector of samples rather than combined into a matrix like in the `DataSplit`):

In [7]:
data_vec = OAR.VectoredDataSplit(data)

OAR.VectoredDataSplit{Float64, Int64}: dim=4, n_train=105, n_test=45:
train_x: (105,) Vector{Vector{Float64}}
test_x: (45,) Vector{Vector{Float64}}
train_y: (105,) Vector{Int64}
test_y: (45,) Vector{Int64}


---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*