## ScientificTypes

A light-weight julia interface for implementing conventions about the
scientific interpretation of data, and for performing type coercions
enforcing those conventions.

ScientificTypes provides:

- A hierarchy of new julia types representing scientific data types
  for use in method dispatch (eg, for trait values). Instances of
  the types play no role:

In [1]:
using ScientificTypes, AbstractTrees
ScientificTypes.tree()

Found
├─ Known
│  ├─ Finite
│  │  ├─ Multiclass
│  │  └─ OrderedFactor
│  ├─ Image
│  │  ├─ ColorImage
│  │  └─ GrayImage
│  ├─ Infinite
│  │  ├─ Continuous
│  │  └─ Count
│  └─ Table
└─ Unknown


- A single method `scitype` for articulating a convention about what
  scientific type each julia object can represent. For example, one
  might declare `scitype(::AbstractFloat) = Continuous`.

- A default convention called *mlj*, based on optional dependencies
  CategoricalArrays, ColorTypes, and Tables, which includes a convenience
  method `coerce` for performing scientific type coercion on
  AbstractVectors and columns of tabular data (any table
  implementing the
  [Tables.jl](https://github.com/JuliaData/Tables.jl) interface). A
  table at the end of this document details the convention.

- A `schema` method for tabular data, based on the optional Tables
  dependency, for inspecting the machine and scientific types of
  tabular data, in addition to column names and number of rows

The only core dependencies of ScientificTypes are Requires and
InteractiveUtils (from the standard library).

### Quick start

Install with `using Pkg; add ScientificTypes`.

Get the scientific type of some julia object, using the default
convention:

In [2]:
scitype(3.14)

ScientificTypes.Continuous

#### Typical type coercion work-flow for tabular data

In [3]:
using CategoricalArrays, DataFrames, Tables
X = DataFrame(name=["Siri", "Robo", "Alexa", "Cortana"],
              height=[152, missing, 148, 163],
              rating=[1, 5, 2, 1])

Unnamed: 0_level_0,name,height,rating
Unnamed: 0_level_1,String,Int64⍰,Int64
1,Siri,152,1
2,Robo,missing,5
3,Alexa,148,2
4,Cortana,163,1


In [4]:
schema(X)

(names = (:name, :height, :rating), types = (String, Union{Missing, Int64}, Int64), scitypes = (ScientificTypes.Unknown, Union{Missing, Count}, ScientificTypes.Count), nrows = 4)

In [5]:
schema(X).scitypes

(ScientificTypes.Unknown, Union{Missing, Count}, ScientificTypes.Count)

In [6]:
fix = Dict(:name=>Multiclass,
           :height=>Continuous,
           :rating=>OrderedFactor);
Xfixed = coerce(fix, X)

│ Coerced to Union{Missing,ScientificTypes.Continuous} instead. 
└ @ ScientificTypes /Users/anthony/Dropbox/Julia7/MLJ/ScientificTypes/src/conventions/mlj/mlj.jl:5


Unnamed: 0_level_0,name,height,rating
Unnamed: 0_level_1,Categorical…,Float64⍰,Categorical…
1,Siri,152.0,1
2,Robo,missing,5
3,Alexa,148.0,2
4,Cortana,163.0,1


In [7]:
schema(Xfixed).scitypes

(ScientificTypes.Multiclass{4}, Union{Missing, Continuous}, ScientificTypes.OrderedFactor{3})

Testing if each column of a table has an element scientific type
that subtypes types from a specified list:

In [8]:
scitype(Xfixed) <: Table(Union{Missing,Continuous}, Finite)

true

### Notes

- We regard the built-in julia type `Missing` as a scientific
  type. The new scientific types introduced in the current package
  are rooted in the abstract type `Found` (see tree above) and we
  export the alias `Scientific = Union{Missing, Found}`.

- `Finite{N}`, `Muliticlass{N}` and `OrderedFactor{N}` are all
  parameterized by an integer `N`. We export the alias `Binary =
  Multiclass{2}`.

- The function `scitype` has the fallback value `Unknown`.

- Since Tables is an optional dependency, the `scitype` of a
  Tables.jl supported table is `Unknown` unless Tables has been imported.

- Developers can define their own conventions using the code in
  "src/conventions/mlj/" as a template. The active convention is
  controlled by the value of `ScientificTypes.CONVENTION[1]`.

### Detailed usage examples

In [9]:
using ScientificTypes

Activate a convention:

In [10]:
mlj() # redundant, as the default

In [11]:
scitype(3.142)

ScientificTypes.Continuous

In [12]:
scitype((2.718, 42))

Tuple{ScientificTypes.Continuous,ScientificTypes.Count}

In [13]:
using CategoricalArrays
v = categorical(['a', 'c', 'a', missing, 'b'], ordered=true)
scitype(v[1])

ScientificTypes.OrderedFactor{3}

In [14]:
scitype(v)

AbstractArray{Union{Missing, OrderedFactor{3}},1}

In [15]:
v = [1, 2, missing, 3];
scitype(v)

AbstractArray{Union{Missing, Count},1}

In [16]:
w = coerce(Multiclass, v);
scitype(w)

│ Coerced to Union{Missing,ScientificTypes.Multiclass} instead. 
└ @ ScientificTypes /Users/anthony/Dropbox/Julia7/MLJ/ScientificTypes/src/conventions/mlj/mlj.jl:5


AbstractArray{Union{Missing, Multiclass{3}},1}

In [17]:
using Tables
T = (x1=rand(10), x2=rand(10), x3=rand(10))
scitype(T)

ScientificTypes.Table{AbstractArray{ScientificTypes.Continuous,1}}

In [18]:
using DataFrames
X = DataFrame(x1=1:5, x2=6:10, x3=11:15, x4=[16, 17, missing, 19, 20]);

In [19]:
scitype(X)

ScientificTypes.Table{Union{AbstractArray{Count,1}, AbstractArray{Union{Missing, Count},1}}}

In [20]:
schema(X)

(names = (:x1, :x2, :x3, :x4), types = (Int64, Int64, Int64, Union{Missing, Int64}), scitypes = (ScientificTypes.Count, ScientificTypes.Count, ScientificTypes.Count, Union{Missing, Count}), nrows = 5)

In [21]:
fix = Dict(:x1=>Continuous, :x2=>Continuous,
           :x3=>Multiclass, :x4=>OrderedFactor)
fixed = coerce(fix, X);
scitype(Xfixed)

│ Coerced to Union{Missing,ScientificTypes.OrderedFactor} instead. 
└ @ ScientificTypes /Users/anthony/Dropbox/Julia7/MLJ/ScientificTypes/src/conventions/mlj/mlj.jl:5


ScientificTypes.Table{Union{AbstractArray{Multiclass{4},1}, AbstractArray{Union{Missing, Continuous},1}, AbstractArray{OrderedFactor{3},1}}}

In [22]:
scitype(Xfixed) <: Table(Continuous, Finite)

false

In [23]:
scitype(Xfixed) <: Table(Continuous, Union{Finite, Missing})

false

### The scientific type  of tuples, arrays and tables

Note that under any convention, the scitype of a tuple is a `Tuple`
type parameterized by scientific types:

In [24]:
scitype((1, 4.5))

Tuple{ScientificTypes.Count,ScientificTypes.Continuous}

Similarly, the scitype of an `AbstractArray` object is
`AbstractArray{U}`, where `U` is the union of the element scitypes:

In [25]:
scitype([1,2,3, missing])

AbstractArray{Union{Missing, Count},1}

Provided the [Tables]() package is loaded, any table implementing
the Tables interface has a scitype encoding the scitypes of its
columns:

In [26]:
using CategoricalArrays
using Tables
X = (x1=rand(10),
     x2=rand(10),
     x3=categorical(rand("abc", 10)),
     x4=categorical(rand("01", 10)))
scitype(X)

ScientificTypes.Table{Union{AbstractArray{Continuous,1}, AbstractArray{Multiclass{3},1}, AbstractArray{Multiclass{2},1}}}

Specifically, if `X` has columns `c1, c2, ..., cn`, then, by definition,

```julia
scitype(X) = Table{Union{scitype(c1), scitype(c2), ..., scitype(cn)}}
```

With this definition, we can perform common type checks associated
with tables. For example, to check that each column of `X` has an
element scitype subtying either `Continuous` or `Finite` (but not
`Union{Continuous, Finite}`!), we check

```julia
scitype(X) <: Table{Union{AbstractVector{Continuous}, AbstractVector{<:Finite}}
```

A built-in `Table` type constructor provides `Table(Continuous, Finite)` as
shorthand for the right-hand side. More generally,

```julia
scitype(X) <: Table(T1, T2, T3, ..., Tn)
 ```

if and only if `X` is a table and, for every column `col` of `X`,
`scitype(col) <: AbstractVector{<:Tj}`, for some `j` between `1` and `n`:

In [27]:
scitype(X) <: Table(Continuous, Finite)

true

Note that `Table(Continuous, Finite)` is a *type* union and not a
`Table` *instance*.

Detailed inspection of column scientific types is included in an
extended form of Tables.schema:

In [28]:
schema(X)

(names = (:x1, :x2, :x3, :x4), types = (Float64, Float64, CategoricalArrays.CategoricalValue{Char,UInt32}, CategoricalArrays.CategoricalValue{Char,UInt32}), scitypes = (ScientificTypes.Continuous, ScientificTypes.Continuous, ScientificTypes.Multiclass{3}, ScientificTypes.Multiclass{2}), nrows = 10)

In [29]:
schema(X).scitypes

(ScientificTypes.Continuous, ScientificTypes.Continuous, ScientificTypes.Multiclass{3}, ScientificTypes.Multiclass{2})

In [30]:
typeof(schema(X))

ScientificTypes.Schema{(:x1, :x2, :x3, :x4),Tuple{Float64,Float64,CategoricalArrays.CategoricalValue{Char,UInt32},CategoricalArrays.CategoricalValue{Char,UInt32}},Tuple{ScientificTypes.Continuous,ScientificTypes.Continuous,ScientificTypes.Multiclass{3},ScientificTypes.Multiclass{2}},10}

### The *mlj* convention

The table below summarizes the *mlj* convention for representing
scientific types:

`T`                               | `scitype(x)` for `x::T`                                                     | requires package
----------------------------------|:----------------------------------------------------------------------------|:------------------------
`Missing`                         | `Missing`                                                                   |
`AbstractFloat`                   | `Continuous`                                                                |
`Integer`                         |  `Count`                                                                    |
`CategoricalValue`                | `Multiclass{N}` where `N = nlevels(x)`, provided `x.pool.ordered == false`  | CategoricalArrays
`CategoricalString`               | `Multiclass{N}` where `N = nlevels(x)`, provided `x.pool.ordered == false`  | CategoricalArrays
`CategoricalValue`                | `OrderedFactor{N}` where `N = nlevels(x)`, provided `x.pool.ordered == true`| CategoricalArrays
`CategoricalString`               | `OrderedFactor{N}` where `N = nlevels(x)` provided `x.pool.ordered == true` | CategoricalArrays
`AbstractArray{<:Gray,2}`         | `GrayImage`                                                                 | ColorTypes
`AbstractArrray{<:AbstractRGB,2}` | `ColorImage`                                                                | ColorTypes
any table type `T` supported by Tables.jl | `Table{K}` where `K=Union{column_scitypes...}`                      | Tables

Here `nlevels(x) = length(levels(x.pool))`.

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*