# Statistics

In [1]:
%use kandy(0.5.0-rc-1)
%use dataframe(0.12.0)
@file:Repository("https://packages.jetbrains.team/maven/p/kds/kotlin-ds-maven")
@file:DependsOn("org.jetbrains.kotlinx:kotlin-statistics-jvm:0.0.2")



This notebook is dedicated to statistical dataset transformations (or just "statistics"/"stats").

The key feature of stats in DSL is that they completely change the dataset. Instead of the old values, it contains new ones with the result of calculation of the corresponding statistical function. Each column of this dataset corresponds to one of the statistics calculated by this function. All the layers created in the context created by this statistic will use this dataset. To refer to new columns-statistics, this context has a property `Stat`, which has property-pointers to these columns.

As an example, consider the "bin" statistic. This statistic performs binning of the values and counts the number of values within each bin (as well as the relative density for each bin). Thus, the dataset created by this statistic will contain three columns:
  * `Stat.BINS` - bin centers;
  * `Stat.COUNT` - number of values in the bin;
  * `Stat.DENSITY` - bin density;

Let's generate a sample from a normal distribution, calculate these statistics for it, and build the area plot with bin centers for x and density for y.

In [2]:
import java.util.Random

val random = Random(1000)
val values = List(1000) { random.nextGaussian() }
val df = dataFrameOf(
    "sample" to values
)

In [14]:
df.plot {
    statBin(sample) {
        area {
            x(Stat.x)
            y(Stat.density)

            alpha = 0.7
            fillColor = Color.GREEN
            borderLine.type = LineType.DASHED
        }
    }
}

We have just constructed a graph that represents the distribution of a numeric variable, also known as a "density plot".

As you can see, the process is the same as the regular plots, except that we used columns-statistics for the mappings.

## Histogram

A histogram is one of the most common and important types of graphs. In fact, a histogram is nothing more than a bar chart with counted bins - bin centers by `x` and count by `y`.

In [15]:
df.plot {
    statBin(sample) {
        bars {
            x(Stat.x)
            y(Stat.count)
        }
    }
}

Of course there is a special shorcut for it - `.histogram()`. You can make sure that it will build exactly the same plot as above.

In [5]:
df.plot {
    histogram(sample)
}

The context created by `histogram` combines the contexts created by `statBin` and `bars`: you can specify mappings and settings of bars aesthetics and use pointers to columns-statistics within it.

In [16]:
df.plot {
    histogram(sample) {
        fillColor(Stat.density) {
            scale = continuous(Color.YELLOW..Color.RED)
        }
        alpha = 0.9
        borderLine {
            width = 0.5
            color = Color.GREY
        }
    }
    layout.size = 750 to 400
}

As mentioned above, `histogram` has default mappings on x and y, but we can override them in the same way as usual, in the layer context. The most common use case is to change the `Y` value of the bars from `count` to `density`.

In [17]:
df.plot {
    histogram(sample) {
        y(Stat.density)
        fillColor = Color.GREEN
    }
}

## Statistic parameters

Statistics have different parameters. For example `statBin` (as well as, of course, `histogram`) has two parameters. The first (`binsOption`) is responsible for how the binning will be performed - by a given number of bins or by their width. The second (`binsAlign`) is responsible for the alignment of the bin.

In [19]:
df.plot {
    histogram(sample, binsOption = BinsOption.byWidth(0.5), binsAlign = BinsAlign.center(0.0))
}

## Raw source statistics

Statistics can also take raw sources (`Iterable` etc.) as an arguments:

In [20]:
plot {
    histogram(values)
}

## Statistics x Grouping

When applying statistics to a grouped dataset, the statistics will be counted within each group. In this case, the dataset will remain grouped by the same keys, but instead of the old groups of datasets there will be groups of datasets counted by statistics. Thus, you will be able to make a mapping of the grouping keys

In [10]:
val valuesAB = List(1000) {random.nextGaussian(3.0, 5.0)} + List(1000) {random.nextGaussian(-7.0, 10.0)}

In [11]:
val dfAB = dataFrameOf(
    "values" to valuesAB,
    "type" to List(1000) {"A"} + List(1000) {"B"}
)

In [22]:
dfAB.groupBy { type }.plot {
    statBin(values) {
        area {
            x(Stat.x)
            y(Stat.density)
            fillColor(key.type)
            alpha = 0.7
        }
    }
 }

Same with the histogram:

In [23]:
dfAB.groupBy { type }.plot {
    histogram(values) {
        fillColor(key.type) {
            scale = categorical("A" to Color.RED, "B" to Color.BLUE)
        }
        position = Position.dodge()
    }
}