# "Bin" statistics & histogram

 Statistics "bin" are counted on the sample of a single continuous variable; firstly, it divides the 
 range of values into bins (sequential, non-overlapping sections), and then it counts the number of observations in each bin. 
 It's weighted; it means the weighted count for each bin is calculated (each element within a bin counted along with its weight).
 It's really important to carefully choose bin constructing method (for example, by exact number of bins or by their width). This decision has a big impact on how the data is shown and studied. It makes sure that the way the data is shown is easy to understand and gives a true picture of the information.

This notebook uses definitions from [DataFrame](https://kotlin.github.io/dataframe/overview.html).

## Usage

Binning is commonly used in statistics and data analysis to simplify complex data sets and make them easier to interpret. Histogram (or any other plot with "bin" statistics) helps to give an overview of the sample distribution.

## Arguments

* Input (mandatory):
  - `x` - numeric sample on which the statistics are calculated; 
* Weights (optional):
  - `weights` - set of weights of the same size as the input sample; `null` (by default) means all weights are equal to `1.0` and the weighted count is equal to the normal one;
* Parameters (optional): 
  - `binsOption: BinsOption` - specifies either the number of bins or their width:
    - `BinsOption.byNumber(n: Int)` - values are divided into `n` bins (bins width are derived);
    - `BinsOption.byWidth(width: Double)` - values are divided into bins of width `width` (bins number is derived);
  - `binsAlign: BinsAlign` - specifies bins aligning:
    - `BinsAlign.center(pos: Double)` - bins are aligned by centering bin in `pos`;
    - `BinsAlign.boundary(pos: Double)` - bins are aligned by boundary between two bins in `pos`;
    - `BinsAlign.none()` - no aligning;

### Generalized signature

The specific signature depends on the function, but all functions related to "bin" statistic (which will be discussed further below - different variations of `statBin()`, `histogram()`) have approximately the same signature with the arguments above:

```
statBinArgs := 
   x, 
   weights = null, 
   binsOption: BinsOption = BinsOption.byNumber(20), 
   binsAlign: BinsAlign = BinsAlign.center(0.0)
```

The possible types of `x` and `weights` depend on where a certain function is used. They can be simply `Iterable` (`List`, `Set`, etc.) or a reference to a column in a `DataFrame` (`String`, `ColumnAccessor`) or the `DataColumn` itself.

## Output statistics

| name                 | type       | description                                              |
|----------------------|------------|----------------------------------------------------------|
| Stat.x               | Double[^1] | Center of bin                                            |
| Stat.count           | Int        | Number of observations in this bin                       |
| Stat.countWeighted   | Double     | Weighted count (sum of observations weights in this bin) |
| Stat.density         | Double     | Empirically estimated density in this bin                |
| Stat.densityWeighted | Double     | Weighted density                                         |

[^1]: TODO: will be changed to `T` generic type of sample in next versions.

## StatBin plots

In [1]:
%use kandy(0.4.5-dev-32)
%use dataframe(0.12.0)
@file:Repository("https://packages.jetbrains.team/maven/p/kds/kotlin-ds-maven")
@file:DependsOn("org.jetbrains.kotlinx:kotlin-statistics-jvm:0.0.1")



In [2]:
// to generate the data we use a standard java math library
// https://commons.apache.org/proper/commons-math/
import org.apache.commons.math3.distribution.NormalDistribution
import org.apache.commons.math3.distribution.UniformRealDistribution

// generate sample from normal distribution
val depthList = NormalDistribution(500.0, 100.0).sample(1000).toList()
// generate sample from uniform distribution
val coeffList = UniformRealDistribution(0.0, 1.0).sample(1000).toList()

In [3]:
// gather them into the DataFrame
val df = dataFrameOf(
    "depth" to depthList,
    "coeff" to coeffList
)

`df` has a signature

<table>
  <thead>
    <tr>
      <th>depth</th>
      <th>coeff</th>
    </tr>
  </thead>
</table>

In [4]:
df.head(5)

Let's take a look at `StatBin` output DataFrame:

In [5]:
df.statBin("depth", "coeff", binsOption = BinsOption.byNumber(10))

It has the following signature:

<table>
  <thead>
    <tr>
      <th alignt="left" colspan="5">Stat</th>
    </tr>
  </thead>
  <thead>
    <tr>
      <th>x</th>
      <th>count</th>
      <th>countWeighted</th>
      <th>density</th>
      <th>densityWeighted</th>
    </tr>
  </thead>
</table>

As you can see, we got a `DataFrame` with one `ColumnGroup` called `Stat` which contains several columns with statics. For `statBin`, each row corresponds to one bin. `Stat.x` is the column with the centers of the bins; `Stat.count` contains the number of observations in the bin. `Stat.countWeighted` - weighted version of `count`. There are also `Stat.density` and `Stat.densityWeighted`. They contain empirically estimated density (both normal and weighted) of sample in the points corresponding to the centers of bins.

`DataFrame` with "bin" statistics is called `StatBinFrame`

### `statBin` context transform

`statBin(statBinArgs) { /*new plotting context*/ }` modifies a plotting context - instead of original data (no matter was it empty or not) new `StatBin` dataset (calculated on given arguments; inputs and weights can be provided as `Iterable` or as dataset column reference - by name as a `String`, as a `ColumnReference` or as a `DataColumn`) is used inside a new context (original dataset and primary context are not affected - you can add layers using initial dataset outside the `statBin` context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the `Stat` group and can be called inside the new context:

In [6]:
plot {
    statBin(depthList, binsAlign = BinsAlign.center(500.0)) {
        // new `StatBin` dataset here
        area {
            // use `Stat.*` columns for mappings
            x(Stat.x)
            y(Stat.count)
            fillColor = Color.RED
            alpha = 0.5
        }
    }
} 

### Histogram layer

Histogram is a statistical plot used for visualizing approximate representation of the distribution of numerical variable. It's a bar plot where each bar is representing a bin: its x coordinate is corresponding to bin range and y to count. So basically, we can build a histogram with `statBin` as follows:

In [7]:
val statBinBarsPlot = df.plot {
    statBin("depth") {
        bars {
            x(Stat.x)
            y(Stat.count)
        }
    }
    layout.title = "`statBin` + `bars`"
}
statBinBarsPlot

But we can do it even faster with `histogram(statBinArgs)` method:

In [8]:
val histogramPlot = plot {
    histogram(depthList)
    layout.title = "`histogram`"
}
histogramPlot

Let's compare them:

In [9]:
plotGrid(listOf(statBinBarsPlot, histogramPlot))

These two plots are identical. Indeed, `histogram` just uses `statBin` and `bars` and performs coordinates mappings under the hood. And of course we can customize histogram layer: `histogram()` optionally opens a new context, where we can configure bars (as in usual context opened by `bars { ... }`) - even change coordinate mappings from default ones. `StatBin` dataset of histogram is also can be accessed here.

In [10]:
df.plot {
    histogram(depth, binsAlign = BinsAlign.center(500.0)) {
        // change a column mapped on `y` to `Stat.density`
        y(Stat.density)
        // filling color depends on `density` statistic
        fillColor(Stat.density) {
            scale = continuous(Color.YELLOW..Color.RED)
        }
        borderLine.color = Color.BLACK
    }
}

If we specify weights, `Stat.countWeighted` mapped to `y` by default:

In [11]:
import org.jetbrains.kotlinx.statistics.dataframe.stat.mean

df.plot {
    // count sample mean
    val mean = depth.mean()
    // add weighted histogram
    histogram(depth, coeff, binsOption = BinsOption.byNumber(10), binsAlign = BinsAlign.boundary(mean))
    // we can add other layers as well
    // let's add a vertical markline in the mean of sample
    vLine {
        xIntercept.constant(mean)
        tooltips() { line("Depth mean: ${String.format("%.2f", mean)}m") }
        color = Color.RED; width = 3.0
    }
    x.axis.name = "depth, m"
}

### `histogram` plot

`histogram(statBinArgs)` and `DataFrame.histogram(statBinArgs)` is a family of functions for fast plotting a histogram.

In [12]:
histogram(depthList, binsAlign = BinsAlign.center(500.0))

In [13]:
df.histogram("depth")

In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one - you should assign `x` input and (optionally) `weight` throw invocation eponymous functions:

In [14]:
df.histogram(binsOption = BinsOption.byNumber(10)) {
    x(depth)
    weight(coeff)
}

Histogram plot can be configured with `.configure {}` extension - it opens context that combines bars, `StatBin` and plot context; that means you can configure bars settings, mappings using `StatBin` dataset and any plot adjustments:

In [15]:
df.histogram(binsOption = BinsOption.byNumber(15)) {
    x(depth)
}.configure {
    // Bars + StatBin + PlotContext
    // can't add new layer
    x.limits = 100..900
    // can add bars mapping, incl. on `Stat.*` columns
    fillColor(Stat.count) { scale = continuous(Color.GREEN..Color.RED) }
    // can configure general plot adjustments
    layout {
        title = "Configured histogram plot"
        size = 600 to 350
    }
}

## Grouped `staBin`

`statBin` can be applied for grouped data - statistics will be counted on each group independently but with equal bins. This application returns a new `GroupBy` dataset with the same keys as the old one but with `StatBin` groups instead of old ones.

In [16]:
// create two samples from normal distribution with different mean/std
val rangesA = NormalDistribution(500.0, 100.0).sample(5000).toList()
val rangesB = NormalDistribution(400.0, 80.0).sample(5000).toList()

In [17]:
// gather them into `DataFrame` with "A" and "B" keys in "category" column
val rangesDF = dataFrameOf(
    "range" to rangesA + rangesB,
    "category" to List(5000) { "A" } + List(5000) { "B" }
)
rangesDF.head(5)

It has the following signature:

<table>
  <thead>
    <tr>
      <th>range</th>
      <th>category</th>
    </tr>
  </thead>
</table>

In [18]:
// group it by "category"
val groupedRangesDF = rangesDF.groupBy { category }
groupedRangesDF

Now we have a `GroupBy` with a signature

<table>
  <thead>
    <tr>
      <th>key: [category]</th>
      <th>group: DataFrame[range|category]</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>A</td>
      <td>A-Group</td>
    </tr>
    <tr>
      <td>B</td>
      <td>B-Group</td>
    </tr>
  </tbody>
</table>

In [19]:
groupedRangesDF.statBin { x(range) }

After `statBin` applying it's still a `GroupBy` but with different signature of `group` - all groups have the same signature as usual `DataFrame` after `statBin` applying (i.e. `StatBinFrame`):

<table>
  <thead>
    <tr>
      <th>key: [category]</th>
      <th>group: StaBinFrame</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>A</td>
      <td>A-Group</td>
    </tr>
    <tr>
      <td>B</td>
      <td>B-Group</td>
    </tr>
  </tbody>
</table>

As you can see, we did indeed do a `statBin` transformation within groups, the grouping keys did not change. Also, all bins centers matches - it helps to build grouped histogram.

The plotting process doesn't change much - we do everything the same. 

In [20]:
groupedRangesDF.plot {
    statBin(range) {
        area {
            x(Stat.x)
            y(Stat.density)
        }
    }
}

As you can see, there are two areas because we have two groups of data. To distinguish them, we need to add mapping to the filling color from the key. This is easy - key is available in the context

In [21]:
groupedRangesDF.plot {
    statBin(range) {
        area {
            x(Stat.x)
            y(Stat.density)
            // can access "key." columns and create mapping from them
            fillColor(key.category)
            alpha = 0.6
        }
    }
}

The `histogram` layer also works. Moreover, if we have exactly one grouping key, a mapping from it to `fillColor` will be created by default.

In [22]:
groupedRangesDF.plot {
    histogram(range)
}

We can customize it like we used to. From the differences - access to `key` columns and we can customize the `position` of bars (within a single x-coordinate), for example - stack them:

In [23]:
groupedRangesDF.plot {
    histogram(column<Double>("range")) {
        fillColor(key.category) {
            scale = categorical(listOf(Color.GREEN, Color.ORANGE))
        }
        borderLine.width = 0.0
        width = 1.0
        // adjust position of bars from diffrente groups
        position = Position.stack()
    }
}

Histogram plot for `GroupBy` (i.e. `GroupBy.histogram(statBinArgs)` extensions) works as well:

In [24]:
groupedRangesDF.histogram("range")

... and can be configured the same way:

In [25]:
groupedRangesDF.histogram(binsAlign = BinsAlign.center(500.0)) { x(range) }.configure {
    alpha = 0.6
    // make the bars from different groups overlap with each other
    position = Position.identity()
    // can access key column by name as `String`
    fillColor("category") { scale = categoricalColorBrewer(BrewerPalette.Qualitative.Dark2) }
}

### Inside `groupBy{}` plot context

We can apply `groupBy` modification to the initial dataset and build a histogram with grouped data the same way:

In [26]:
rangesDF.plot {
    groupBy(category) {
        histogram(range)
    }
}