# "Count2D" statistics & heatmap

 Statistics "count2d" are calculated on the sample of two categorical variables (usually provided as two samples of single variable - `x` and `y`); it counts the number of observations in each pair of x-category and y-category. 
 It's weighted; it means the weighted count for each pair is calculated (each element within a pair counted along with its weight).

This notebook uses definitions from [DataFrame](https://kotlin.github.io/dataframe/overview.html).

## Usage

"Count2D" plots give a visual representation of the two-variables discrete sample distribution.

## Arguments

* Input (mandatory):
  - `x` - `x`-part of input sample;
  - `y` - `y`-part of input sample; 
* Weights (optional):
  - `weights` - set of weights of the same size as the input samples; `null` (by default) means all weights are equal to `1.0` and the weighted count is equal to the normal one;

### Generalized signature

The specific signature depends on the function, but all functions related to "count2d" statistic (which will be discussed further below - different variations of `statCount2D()`, `heatmap()`) have approximately the same signature with the arguments above:

```
statCount2DArgs := 
   x,
   y, 
   weights = null
```

The possible types of `x`, `y` and `weights` depend on where a certain function is used. They can be simply `Iterable` (`List`, `Set`, etc.) or a reference to a column in a `DataFrame` (`String`, `ColumnAccessor`) or the `DataColumn` itself. `x` elements are type of `X` - generic type parameter; `y` elements are type of `Y` - generic type parameter.

## Output statistics

| name               | type   | description                                                   |
|--------------------|--------|---------------------------------------------------------------|
| Stat.x             | X      | `x`-category                                                  |
| Stat.y             | Y      | `y`-category                                                  |
| Stat.count         | Int    | Number of observations in this category                       |
| Stat.countWeighted | Double | Weighted count (sum of observations weights in this category) |

## StatCount plots

In [1]:
%use dataframe(0.11.1)
%use kandy(0.4.5-dev-27)
@file:Repository("https://packages.jetbrains.team/maven/p/kds/kotlin-ds-maven")
@file:DependsOn("org.jetbrains.kotlinx:kotlin-statistics-jvm:0.0.0-dev-6")



In [2]:
// use "mpg" dataset
val mpgDF = DataFrame.readCSV("https://raw.githubusercontent.com/JetBrains/lets-plot-kotlin/master/docs/examples/data/mpg.csv")
mpgDF.head(5)

In [24]:
// we need only three columns
val df = mpgDF["class", "drv", "hwy"]
df.head(5)

It has a signature

<table>
  <thead>
    <tr>
      <th>class</th>
      <th>drv</th>
      <th>hwy</th>
    </tr>
  </thead>
</table>

Let's take a look at `StatCount2D` output DataFrame:

In [25]:
df.statCount2D("class", "drv", "hwy")

It has the following signature:

<table>
  <thead>
    <tr>
      <th alignt="left" colspan="4">Stat</th>
    </tr>
  </thead>
  <thead>
    <tr>
      <th>x</th>
      <th>y</th>
      <th>count</th>
      <th>countWeighted</th>
    </tr>
  </thead>
</table>

As you can see, we got a `DataFrame` with one `ColumnGroup` called `Stat` which contains several columns with statics. For `statCount2D`, each row corresponds to one pair of categories. `Stat.x` is the column with its `x`-category; `Stat.y` is the column with its `y`-category; `Stat.count` contains the number of observations in the pair. `Stat.countWeighted` - weighted version of `count`.
`DataFrame` with "count2D" statistics is called `StatCount2DFrame`

### `statCount2D` plot transform

`statCount2D(statCount2DArgs) { /*new plotting context*/ }` modifies a plotting context - instead of original data (no matter was it empty or not) new `statCount2D` dataset (calculated on given arguments; inputs and weights can be provided as `Iterable` or as dataset column reference - by name as a `String`, as a `ColumnReference` or as a `DataColumn`) is used inside a new context (original dataset and primary context are not affected - you can add layers using initial dataset outside the `statCount2D` context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the `Stat` group and can be called inside the new context:

In [31]:
plot {
    statCount2D(df.`class`, df.drv) {
        // new `StatCount` dataset here
        points {
            // use `Stat.*` columns for mappings
            x(Stat.x)
            y(Stat.y)
            size(Stat.count) {
                scale = continuous(10.0..30.0)
            }
            color = Color.RED
        }
    }
} 

In [32]:
%use lets-plot

In [49]:
letsPlot(df.toMap()) + geomPoint(stat=Stat.count2d()) {
    x = "class"
    y = "drv"
    size = "..count.."
} + scaleSize(10 to 30) + scaleXDiscrete(expand = listOf(0.1, 0.5))

### Countplot

Countplot is a statistical plot used for visualizing the distribution of categorical variable. It's a bar plot where each bar is representing one of categories: its `x` coordinate is corresponding to category and `y` to its count. So basically, we can build a histogram with `statCount` as follows:

In [6]:
val statCountAndBarsPlot = mpgDF.plot {
    statCount("class") {
        bars {
            x(Stat.x)
            y(Stat.count)
        }
    }
    layout.title = "`statCount()` + `bars()` layer"
}
statCountAndBarsPlot

But we can do it even faster with `countPlot(statCountArgs)` method:

In [7]:
val countplotPlot = plot {
    countPlot(mpgDF.`class`)
    layout.title = "`countPlot()` layer"
}
countplotPlot

Let's compare them:

In [8]:
plotGrid(listOf(statCountAndBarsPlot, countplotPlot))

These two plots are identical. Indeed, `countPlot` just uses `statCount` and `bars` and performs coordinates mappings under the hood. And of course we can customize countplot layer: `countPlot()` optionally opens a new context, where we can configure bars (as in usual context opened by `bars { ... }`) - even change coordinate mappings from default ones. `StatCount` dataset of countplot is also can be accessed here.

In [9]:
df.plot {
    countPlot(`class`) {
        // filling color depends on `count` statistic
        fillColor(Stat.count) {
            scale = continuous(Color.GREEN..Color.RED)
        }
        borderLine.color = Color.BLACK
    }
}

If we specify weights, `Stat.countWeighted` mapped to `y` by default:

In [10]:
import org.jetbrains.kotlinx.statistics.dataframe.stat.mean

df.plot {
    countPlot(`class`, hwy)
    // we can add other layers as well
    // let's add a horizontal markline with constant y intercept:
    hLine {
        val criticalCount = 500
        yIntercept.constant(criticalCount)
        tooltips() { line("Crirical count: ${String.format("%d", criticalCount)}") }
        color = Color.RED; width = 3.0
    }
    x.axis.name = "Car class"
}

Line_23.jupyter.kts (4:24 - 27) Unresolved reference. None of the following candidates is applicable because of receiver type mismatch: 
public final val ColumnsContainer<Line_12_jupyter._DataFrameType>.hwy: DataColumn<Int> defined in Line_12_jupyter
public final val ColumnsContainer<Line_12_jupyter._DataFrameType?>.hwy: DataColumn<Int?> defined in Line_12_jupyter
public final val DataRow<Line_12_jupyter._DataFrameType>.hwy: Int defined in Line_12_jupyter
public final val DataRow<Line_12_jupyter._DataFrameType?>.hwy: Int? defined in Line_12_jupyter

### `countPlot` plot

`countPlot(statCountArgs)` and `DataFrame.countPlot(statCountArgs)` is a family of functions for fast plotting a countplot.

In [11]:
countPlot(listOf("A", "A", "A", "B", "B", "C", "B", "B"))

In [12]:
df.countPlot("class")

In case you want to provide inputs and weights using column selection DSL, it's a bit different from the usual one - you should assign `x` input and (optionally) `weight` throw invocation eponymous functions:

In [13]:
df.countPlot() {
    x(`class`)
    weight(hwy)
}

Line_26.jupyter.kts (3:12 - 15) Unresolved reference. None of the following candidates is applicable because of receiver type mismatch: 
public final val ColumnsContainer<Line_12_jupyter._DataFrameType>.hwy: DataColumn<Int> defined in Line_12_jupyter
public final val ColumnsContainer<Line_12_jupyter._DataFrameType?>.hwy: DataColumn<Int?> defined in Line_12_jupyter
public final val DataRow<Line_12_jupyter._DataFrameType>.hwy: Int defined in Line_12_jupyter
public final val DataRow<Line_12_jupyter._DataFrameType?>.hwy: Int? defined in Line_12_jupyter

CountPlot plot can be configured with `.configure {}` extension - it opens context that combines bars, `StatCount` and plot context; that means you can configure bars settings, mappings using `StatCount` dataset and any plot adjustments:

In [14]:
df.countPlot {
    x(`class`)
}.configure {
    // Bars + StatCount + PlotContext
    // can't add new layer
    // can add bars mapping, including for `Stat.*` columns
    fillColor(Stat.x)
    alpha = 0.6
    // can configure general plot adjustments
    layout {
        title = "Configured `countPlot` plot"
        size = 600 to 350
    }
}

## Grouped `statCount`

`statCount` can be applied for grouped data - statistics will be calculated on each group independently but with equal categories. This application returns a new `GroupBy` dataset with the same keys as the old one but with `StatCount` groups instead of old ones.

In [15]:
// group our dataframe by `drv` column
val groupedDF = df.groupBy { drv }
groupedDF

Now we have a `GroupBy` with a signature

<table>
  <thead>
    <tr>
      <th>key: [drv]</th>
      <th>group: DataFrame[class|drv|hwy]</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>"f"</td>
      <td>"f"-Group</td>
    </tr>
    <tr>
      <td>"4"</td>
      <td>"4"-Group</td>
    </tr>
    <tr>
      <td>"r"</td>
      <td>"r"-Group</td>
    </tr>
  </tbody>
</table>

In [16]:
groupedDF.statCount { x(`class`) }

After `statBin` applying it's still a `GroupBy` but with different signature of `group` - all groups have the same signature as usual `DataFrame` after `statBin` applying (i.e. `StatBinFrame`):

<table>
  <thead>
    <tr>
      <th>key: [drv]</th>
      <th>group: StaCountFrame</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>"f"</td>
      <td>"f"-Group</td>
    </tr>
    <tr>
      <td>"4"</td>
      <td>"4"-Group</td>
    </tr>
    <tr>
      <td>"r"</td>
      <td>"r"-Group</td>
    </tr>
  </tbody>
</table>

As you can see, we did indeed do a `statCount` transformation within groups, the grouping keys did not change.

The plotting process doesn't change much - we do everything the same. 

In [17]:
groupedDF.plot {
    statCount(`class`) {
        bars {
            x(Stat.x)
            y(Stat.countWeighted)
        }
    }
}

As you can see, there are several bars in some categories because we have three groups of data. To distinguish them, we need to add mapping to the filling color from the key. This is easy - key is available in the context

In [18]:
groupedDF.plot {
    statCount(`class`) {
        bars {
            x(Stat.x)
            y(Stat.countWeighted)
            fillColor(key.drv)
        }
    }
}

The `countPlot` layer also works. Moreover, if we have exactly one grouping key, a mapping from it to `fillColor` will be created by default.

In [19]:
groupedDF.plot {
    countPlot("class")
}

We can customize it like we used to. From the differences - access to `key` columns and we can customize the `position` of bars (within a single x-coordinate), for example - stack them:

In [20]:
groupedDF.plot {
    countPlot(column<String>("class")) {
        fillColor(key.drv) {
            scale = categorical(listOf(Color.GREEN, Color.ORANGE, Color.LIGHT_PURPLE))
        }
        borderLine.width = 0.0
        width = 1.0
        // adjust position of bars
        position = Position.stack()
    }
}

CountPlot plot for `GroupBy` (i.e. `GroupBy.countPlot(statCountArgs)` extensions) works as well:

In [21]:
groupedDF.countPlot("class")

... and can be configured the same way:

In [22]:
groupedDF.countPlot { x(`class`) }.configure {
    alpha = 0.6
    // make the bars from different groups overlap with each other
    position = Position.identity()
    // can access key column by name as `String`
    fillColor("drv") { scale = categoricalColorBrewer(BrewerPalette.Qualitative.Dark2) }
}

### Inside `groupBy{}` plot context

We can apply `groupBy` modification to the initial dataset and build a histogram with grouped data the same way:

In [23]:
df.plot {
    groupBy(drv) {
        countPlot(`class`)
    }
}