# "Bin" statistics & histogram

 Statistics "bin" are counted on the sample of a single continuous variable by dividing the range of values into bins and counting the number of observations in each bin. It's weighted; the weighted count for each bin is counted (each element within a bin counted along with its weight).

## Arguments

* Input:
  - `x` - numeric sample; 
* Weights:
  - `weights` - set of weights of the same size as the input sample; `null` (by default) means all weights
* Parameters: 
  - `binsOption: BinsOption` - specifies either the number of bins or their width:
    - `BinsOption.byNumber(n: Int)` - values are divided into `n` bins (bins width are derived);
    - `BinsOption.byWidth(width: Double)` - values are divided into bins of width `width` (bins number is derived);
  - `binsAlign: BinsAlign` - specifies bins aligning:
    - `BinsAlign.center(pos: Double)` - bins are aligned by centering bin in `pos`;
    - `BinsAlign.boundary(pos: Double)` - bins are aligned by boundary between two bins in `pos`;
    - `BinsAlign.none()` - no aligning;

## Output statistics

| name                 | type       | description                                              |
|----------------------|------------|----------------------------------------------------------|
| Stat.x               | Double[^1] | Center of bins                                           |
| Stat.count           | Int        | Number of observations in this bin                       |
| Stat.countWeighted   | Double     | Weighted count (sum of observations weights in this bin) |
| Stat.density         | Double     | Empirically estimated density in this bin                |
| Stat.densityWeighted | Double     | Weighted density                                         |

[^1]: TODO: will be changed to `T` generic type of sample in next versions.

## StatBin plots

In [1]:
%use dataframe(0.11.1)
%use kandy(0.4.5-dev-27)

@file:Repository("https://packages.jetbrains.team/maven/p/kds/kotlin-ds-maven")
@file:DependsOn("org.jetbrains.kotlinx:kotlin-statistics-jvm:0.0.0-dev-3")



In [2]:
import org.apache.commons.math3.distribution.NormalDistribution
import org.apache.commons.math3.distribution.UniformRealDistribution

// generate sample from normal distribution
val depthList = NormalDistribution(500.0, 100.0).sample(1000).toList()
// generate sample from uniform distribution
val coeffList = UniformRealDistribution(0.0, 1.0).sample(1000).toList()

In [3]:
// gather them in the DataFrame
val df = dataFrameOf(
    "depth" to depthList,
    "coeff" to coeffList
)

Let's take a look at `StatBin` output DataFrame:

In [4]:
df.statBin("depth", "coeff", binsOption = BinsOption.byNumber(10))

### `statBin` transform

`statBin(\*stat bin arguments*\) { \*new plotting context*\ }` modifies a plotting context - instead of original data (no matter was it empty or not) new `StatBin` dataset (calculated on given arguments; inputs and weights can be provided as `Iterable` or as dataset column reference - by name as a `String`, as a `ColumnReference` or as a `DataColumn`) is used inside a new context (original dataset and primary context are not affected - you can add layers using initial dataset outside the `statBin` context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the `Stat` group and can be called inside the new context:

In [5]:
plot {
    statBin(depthList, binsAlign = BinsAlign.center(500.0)) {
        // new `StatBin` dataset here
        area {
            x(Stat.x)
            y(Stat.count)
            fillColor = Color.RED
            alpha = 0.5
        }
    }
} 

### Histogram

Histogram is a statistical plot used for visualizing approximate representation of the distribution of numerical variable. It's a bar plot where each bar is representing a bin: its x coordinate is corresponding to bin range and y to count. So basically, we can build a histogram with `statBin` as follows:

In [6]:
df.plot {
    statBin("depth") {
        bars {
            x(Stat.x)
            y(Stat.count)
        }
    }
} 

But we can do it even faster with `histogram(\*stat bin arguments*\)` method:

In [7]:
plot {
    histogram(depthList)
}

Plots are identical. Indeed, `histogram` just uses `statBin` and `bars` and performs coordinates mappings under the hood. And of course we can customize histogram layer: `histogram()` optionally opens a new context, where we can configure bars (as in usual context opened by `bars { ... }`) - even change coordinate mappings from default ones. `StatBin` dataset of histogram is also can be accessed here.

In [8]:
df.plot {
    histogram(depth, binsAlign = BinsAlign.center(500.0)) {
        // change a column mapped on `y` to `Stat.density`
        y(Stat.density)
        fillColor(Stat.density) {
            scale = continuous(Color.YELLOW..Color.RED)
        }
        borderLine.color = Color.BLACK
    }
}

If we specify weights, `Stat.countWeighted` mapped to `y` by default:

In [9]:
import org.jetbrains.kotlinx.statistics.dataframe.stat.mean

df.plot {
    val mean = depth.mean()
    histogram(depth, coeff, binsOption = BinsOption.byNumber(10), binsAlign = BinsAlign.boundary(mean))
    // we can add other layers as well
    vLine {
        xIntercept.constant(mean)
        tooltips() { line("Depth mean: ${String.format("%.2f", mean)}m") }
        color = Color.RED; width = 3.0
     }
     x.axis.name = "depth, m"
}

### `histogram` plot

`histogram(\*stat bin arguments*\)` and `DataFrame.histogram(\*stat bin arguments*\)` is a family of functions for fast plotting a histogram.

In [10]:
histogram(depthList, binsAlign = BinsAlign.center(500.0))

In [11]:
df.histogram("depth")

In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one - you should assign `x` input and (optionally) `weight` throw invocation eponymous functions:

In [12]:
df.histogram(binsOption = BinsOption.byNumber(10)) {
    x(depth)
    weight(coeff)
}

Histogram plot can be configured with `.configure {}` extension - it opens context that combines bars, `StatBin` and plot context; that means you can configure bars settings, mappings using `StatBin` dataset and any plot adjustments:

In [13]:
df.histogram(binsOption = BinsOption.byNumber(10)) {
    x(depth)
}.configure {
    // Bars + StatBin + PlotContext
    // can't add new layer
    x.limits = 0..1000
    fillColor(Stat.count) { scale = continuous(Color.GREEN..Color.RED)}
    layout.size = 1000 to 500
}

## Grouped `staBin`

`statBin` can be applied for grouped data - statistics will be counted on each group independently but with equal bins. This application returns a new `GroupBy` dataset with the same keys as the old one but with `StatBin` groups instead of old ones.

In [14]:
// create two samples from normal distribution with different mean/std
val rangesA = NormalDistribution(500.0, 100.0).sample(5000).toList()
val rangesB = NormalDistribution(400.0, 80.0).sample(5000).toList()

In [15]:
// gather them into `DataFrame` with "A" and "B" keys in "category" column
val rangesDF = dataFrameOf(
    "range" to rangesA + rangesB,
    "category" to List(5000) {"A"} + List(5000) {"B"}
)

In [16]:
// group it by "category"
val groupedRangesDF = rangesDF.groupBy {category}

In [17]:
groupedRangesDF.statBin { x(range) }

As you can see, we did indeed do a `statBin` transformation within groups, the grouping keys did not change. Also, all bins centers matches - it helps to build grouped histogram.

The plotting process doesn't change much - we do everything the same. 

In [18]:
groupedRangesDF.plot {
    statBin(range) {
        area { 
            x(Stat.x)
            y(Stat.density)
         }
    }
 }

As you can see we have built two areas because we have two groups of data. To distinguish them we need to add mapping to the key. This is easy - it is available in the context

In [19]:
groupedRangesDF.plot {
    statBin(range) {
        area { 
            x(Stat.x)
            y(Stat.density)
            fillColor(key.category)
            alpha = 0.6
         }
    }
 }

The `histogram` layer also works. Moreover, if we have exactly one grouping key, a mapping from it to `fillColor` will be created by default.

In [20]:
groupedRangesDF.plot {
    histogram(range)
 }

We can customize it like we used to. From the differences - access to `key` columns and we can customize the `position` of bars, for example - stack them:

In [21]:
groupedRangesDF.plot {
    histogram(column<Double>("range")) {
        fillColor(key.category) {
            scale = categorical(listOf(Color.GREEN, Color.ORANGE))
        }
        borderLine.width = 0.0
        position = Position.stack()
        width = 1.0
    }
 }

Histogram plot for `GroupBy` (i.e. `GroupBy.histogram()` extensions) works as well:

In [22]:
groupedRangesDF.histogram("range")

... and can be configured the same way:

In [23]:
groupedRangesDF.histogram(binsAlign = BinsAlign.center(500.0)) { x(range) }.configure { 
    alpha = 0.6
    position = Position.identity()
    fillColor("category") { scale = categoricalColorBrewer(BrewerPalette.Qualitative.Dark2) }
 }

### Inside `groupBy{}` plot context

In [24]:
rangesDF.plot {
    groupBy(category) {
        histogram(range)
    }
}