# Statistics in Kandy

In [1]:
%use dataframe(0.11.1)
%use kandy(0.4.5-dev-27)

@file:Repository("https://packages.jetbrains.team/maven/p/kds/kotlin-ds-maven")
@file:DependsOn("org.jetbrains.kotlinx:kotlin-statistics-jvm:0.0.0-dev-3")



`kandy-statistics` allows you to build statistical plots, i.e., plots with statistical transformations of data. With them, you can explore your data in a better way фs well as visualize important statistical observations.

## How statistics works?

The workflow of statistical transformations is simple. You have some dataset - it can be a single `List` or a whole `DataFrame`. Statistics consume one or more sets of values (`List`, `DataColumn`) from this dataset and imports a new dataset with the transformed data. Then this dataset is used for visualization. Kandy has API for explicit work with this dataset as well as more simplified for quick plotting.

### `statBin` anatomy example

Let's look at an example. The `bin` statistic is one of the most used — it allows you to split observations by bins and count number of observations in each one. It is used to construct one of the most common statistical plots - histogram. But before we build a histogram, let's examine the statistics.

In [2]:
import org.apache.commons.math3.distribution.NormalDistribution
import org.apache.commons.math3.distribution.UniformRealDistribution

// generate sample from normal distribution
val sample = NormalDistribution().sample(1000).toList()
// generate weights from uniform distribution
val weights = UniformRealDistribution(0.0, 1.0).sample(1000).toList()

Each statistic has several types of arguments:
1) Main inputs - one or more sets of values (usually named `x`, `y`, `z`) on which the statistic is counted - these are the only mandatory arguments. All inputs must be of the same size.
2) weight; some statistics are weighted, i.e. the weight of each element will be taken into account. To pass it, the optional argument `weights` is used. This set must have the same size as the main inputs.
3) Statistics parameters. Each statistic has its unique parameters, on which its calculation depends directly. All of them have a default value.

Let's look at checklist of these arguments for `statBin`:
1) `statBin` consumes exactly one values set - sample of values to bin (`x`).
2) It's weighted. In addition to `count` (i.e., the number of observations within bin) `statBin` counts `countWeighted` statistic, i.e., weighted count - sum of the weights of the observations within bin. To calculate this, pass a `weights` set of the same size as the sample.
3) `statBin` has two parameters both of which configure bins
   * `binOptions` allows you to specify either the number of bins or their width.
   * `binAlign` sets the alignment of the bin.

Let's use it on our sample...

In [3]:
val statBinData = statBin(
    sample, // pass a sample as an input
    null, // don't provide weights
    BinsOption.byNumber(20), // set the number of bins
    BinsAlign.center(0.0), // set the align of bins
)

...and take a look at output dataset:

In [4]:
statBinData

As you can see, we got a `DataFrame` with one `ColumnGroup` called `Stat` which contains several columns with statics. For `statBin`, each row corresponds to one bin. `Stat.x` is the column with the centers of the bins; `Stat.count` contains the number of observations in the bin. `Stat.countWeighted` - weighted version of `count` (but since we do not pass weights, this column differs from the previous one only in type - `Double` instead of `Int`; values are the same). There are also `Stat.density` and `Stat.densityWeighted`. They contain empirically estimated density (both normal and weighted) of sample in the points corresponding to the centers of bins.

### Awesome! But what about plotting?

As mentioned earlier, `statBin` is used to plot a histogram. And now, having our new dataset, it is really easy to build it - for a classic histogram we need bars with coordinates (x: bin center (i.e. `Stat.x`), y = bin count (i.e. `Stat.count`)):

In [5]:
statBinData.plot {
    bars {
        x(Stat.x)
        y(Stat.count)
    }
    layout.title = "Our awesome histogram!"
}

Of course, we won't need to explicitly calculate a new dataset every time. Moreover, we will not need to define the histogram manually again each time. There are different types of APIs for this purpose, which are described in the next chapter.

## Statistics APIs

### Stat-transform API

"Stat-transform" API allows you to transform a dataset right inside `PlotContext`, calculating stats on the fly. It is essentially a set of extensions for `PlotContext` that have the usual statistics API (input samples, weights and parameters) but also open a new context. As usual, new layers can be created in this context, but within it they will have a new dataset - a dataset with a statistical transformation.

In [6]:
val df = dataFrameOf("sample" to sample, "weigths" to weights)

In [7]:
df.plot {
    statBin(sample, weigths, binsOption = BinsOption.byWidth(0.25)) {
        // new "StatBin" dataset inside this context
        line {
            // old dataset is not actual, so we can use `Stat.` columns of a new one
            x(Stat.x)
            y(Stat.density)
        }
    }
    // dataset hasn't changed here so we can use it in usual way
    vLine {
        xIntercept.constant(sample.mean())
        width = 3.0
        color = Color.RED
    }
}

### Stat-layers API

"Stat-layers" API is a set of shortcuts for the most popular statistical graphs (such as a histogram); it's an integration of "stat-transform" API and regular layers - with just one function we can plot a statistical layer (i.e. it's an amalgamation of 3 whole things - stat counting, layer creation and default mappings)

In [8]:
plot {
    // equal to `statBin` + `bars` + x/y mappings on Stat.x/Stat.count
    histogram(sample)
}

Everything is the same, however, 3 times less code! But that doesn't mean we lose flexibility. First of all, `.histogram()` has all the same arguments as `.statBin()`, which means we can fully control the counting of statistics. Second, it optionally creates a new context - a union of `bars` and `statBin` contexts. This will allow you to customize `bars` (including overriding default mappings!).

In [9]:
plot {
    histogram(sample, weights, binsAlign = BinsAlign.center(0.0)) {
        // This context combines `bars` and `statBin` context; that means we can
        // make `bars` mappings and use `Stat.` columns.
        // By default `Stat.count` is mapped on `y` if weights are not provided;
        // however we can easliy override mapping to `y`, for example from `Stat.density`
        y(Stat.density)
        fillColor(Stat.density) {
            scale = continuous(Color.GREEN..Color.RED)
        }
    }
    x.limits = -3.5..3.5
}

### Stat-plots API

"Stats-plots" API allows you to build a histogram even faster - only with one function! Usually it is a function or set of extensions for a `DataFrame` with standard statistic arguments (inputs, weights, parameters).

In [10]:
histogram(sample)

or

In [11]:
df.histogram("sample", binsOption = BinsOption.byNumber(10))

Column selection DSL for stat plots is slightly different from the standard one. You still open a new scope in which you can access the columns of the dataframe.  However, unlike the classic one, you must not return the columns as the result of the expression, but rather access the inputs of the statistics through the function of the same name. Weights are provided in the same way. 

In [12]:
df.histogram() {
    x(sample)
    weight(weigths)
}

And of course, stat plots can be configured. We can configure layer mappings and settings exactly as in stat layer, and also change the general settings of the plot. The `.configure()` extension is used for this purpose - it opens a context that combines several contexts you are familiar with - stat context, layer context and plot context:

In [13]:
df.histogram(BinsOption.byNumber(14), BinsAlign.boundary(0.0)) {
    x(sample)
}.configure {
    // StatBin + Bars + Plot contexts
    x.limits = -3.5 .. 3.5
    y(Stat.density)
    borderLine.color = Color.BLACK
    layout.title = "Configured histogram"
 }