# "Smooth" statistics & smooth line plot

 Statistics "smooth" are calculated on the sample of two continuous variables (i.e., sample of points or line); It interpolates data points to create a smoother curve.

This notebook uses definitions from [DataFrame](https://kotlin.github.io/dataframe/overview.html).

## Usage

"Smooth" statistic is useful in case of the presence of overplotting or noises, making it easier to discern underlying trends and patterns. It can also be used to make a more pretty line with a small number of points.

## Arguments

* Input (mandatory):
  - `x` - numeric sample of input points `x` coordinates;
  - `y` - numeric sample of input points `y` coordinates;
* Parameters (optional):
  - `method: SmoothMethod` - smoothing model:
    - `SmoothMethod.Linear(confidenceLevel: Double)` -  linear model;
    - `SmoothMethod.Polynomial(degree: Int, confidenceLevel: Double)` -  polynomial model;
    - `SmoothMethod.LOESS(span: Double, loessCriticalSize: Int, samplingSeed: Long, confidenceLevel: Double)` -  Local Polynomial Regression model;
  - `smootherPointCount: Int` - number of sampled points;

### Generalized signature

The specific signature depends on the function, but all functions related to "smooth" statistic (which will be discussed further below - different variations of `statSmooth()`, `smoothLine()`) have approximately the same signature with the arguments above:

```
statSmoothArgs := 
   x, 
   y,
   method: SmoothMethod = SmoothMethod.LOESS(),
   smootherPointCount: Int = 100
```

The possible types of `x` and `y` depend on where a certain function is used. They can be simply `Iterable` (`List`, `Set`, etc.) or a reference to a column in a `DataFrame` (`String`, `ColumnAccessor`) or the `DataColumn` itself.

## Output statistics

| name      | type   | description                                         |
|-----------|--------|-----------------------------------------------------|
| Stat.x    | Double | `x` coordinate                                      |
| Stat.y    | Double | `y` coordinate                                      |
| Stat.yMin | Double | Lower pointwise confidence interval around the mean |
| Stat.yMax | Double | Upper pointwise confidence interval around the mean |
| Stat.se   | Double | Standard error                                      |

## StatSmooth plots

In [1]:
%use dataframe(0.11.1)
%use kandy(0.4.5-dev-29)
@file:Repository("https://packages.jetbrains.team/maven/p/kds/kotlin-ds-maven")
@file:DependsOn("org.jetbrains.kotlinx:kotlin-statistics-jvm:0.0.0-dev-11")



In [2]:
// to generate the data we use a standard java math library
// https://commons.apache.org/proper/commons-math/
import org.apache.commons.math3.distribution.NormalDistribution
import org.apache.commons.math3.distribution.UniformRealDistribution
import kotlin.random.Random

// generate line with formula
val xs = (-100..100).map { it / 50.0 }
val lineFormula = {x: Double -> 2.0 / (x * x + 0.5)}
// generate noises from normal distribution
val noises = NormalDistribution(0.0, 0.1).sample(xs.size).toList()
val ys = xs.zip(noises).map { lineFormula(it.first) + it.second }
// and drop 2/3 points
val (newXs, newYs) = xs.zip(ys).shuffled(Random(17)).take(xs.size * 1 / 3).sortedBy { it.first }.unzip()

In [3]:
// gather them into the DataFrame
val df = dataFrameOf(
    "speed" to newXs,
    "efficiency" to newYs
)

In [4]:
df.head(5)

`df` has a signature

<table>
  <thead>
    <tr>
      <th>speed</th>
      <th>efficiency</th>
    </tr>
  </thead>
</table>

Let's take a look at `StatSmooth` output DataFrame:

In [5]:
df.statSmooth("speed", "efficiency").head(5)

It has the following signature:

<table>
  <thead>
    <tr>
      <th alignt="left" colspan="5">Stat</th>
    </tr>
  </thead>
  <thead>
    <tr>
      <th>x</th>
      <th>y</th>
      <th>yMin</th>
      <th>yMax</th>
      <th>se</th>
    </tr>
  </thead>
</table>

As you can see, we got a `DataFrame` with one `ColumnGroup` called `Stat` which contains several columns with statics. For `statSmooth`, each row corresponds to one of line points. `Stat.x` is the column with this point `x` coordinate; `Stat.y` is points `y` coordinate; `Stat.yMin` - lower point of confidence level; `Stat.yMax` - upper point of confidence level; `Stat.se` - standard error.
`DataFrame` with "smooth" statistics is called `StatSmoothFrame`

### `statSmooth` transform

`statSmooth(statSmoothArgs) { /*new plotting context*/ }` modifies a plotting context - instead of original data (no matter was it empty or not) new `StatSmooth` dataset (calculated on given arguments; inputs can be provided as `Iterable` or as dataset column reference - by name as a `String`, as a `ColumnReference` or as a `DataColumn`) is used inside a new context (original dataset and primary context are not affected - you can add layers using initial dataset outside the `statSmooth` context). Since the old dataset is irrelevant, we cannot use references for its columns. But we can refer to the new ones. They are all contained in the `Stat` group and can be called inside the new context:

In [6]:
plot {
    statSmooth(newXs, newYs) {
        // new `StatSmooth` dataset here
        area {
            // use `Stat.*` columns for mappings
            x(Stat.x)
            y(Stat.y)
        }
    }
    points {
        x(newXs)
        y(newYs)
    }
} 

In [7]:
df.plot { 
    statSmooth(speed, efficiency, method = SmoothMethod.Polynomial(2), smootherPointCount = 250) {
        ribbon { 
            x(Stat.x)
            yMin(Stat.yMin)
            yMax(Stat.yMax)
         }
    }
    // dataset is noe changed here
    points {
        x(speed)
        y(efficiency)
    }
 }

### `smoothLine` layer

`smoothLine` layer is a shortcut for fast plotting a smoothed line:

In [8]:
val smoothLineLayerPlot = plot {
    smoothLine(newXs, newYs)
    layout.title = "`smoothLine()` layer"
}
smoothLineLayerPlot

In [9]:
// compare it with `statSmooth` + usual `line`
val statSmoothAndLinePlot = plot {
    statSmooth(newXs, newYs) {
        line {
            x(Stat.x)
            y(Stat.y)
        }
    }
   layout.title = "`statSmooth()` + non-statistical `line` layer"
} 
plotGrid(listOf(smoothLineLayerPlot, statSmoothAndLinePlot))

`smoothLine` uses `statSmooth` and `line` and performs coordinates mappings under the hood. And of course we can customize `smoothLine` layer: `smoothLine()` optionally opens a new context, where we can configure line (as in usual context opened by `line { ... }`) - even change coordinate mappings from default ones. `StatSmooth` dataset of `smoothLine` is also can be accessed here.

In [10]:
df.plot {
    smoothLine(speed, efficiency, SmoothMethod.LOESS(span = 0.1), speed.size()) {
         // change a column mapped on `y` to `Stat.scaled`
        y(Stat.yMax)
        color = Color.RED
        width = 4.0
    }
    points {
        x(speed)
        y(efficiency)
    }
}

## Grouped `statSmoth`

`statSmooth` can be applied for grouped data - statistics will be calculated on each group independently but with equal categories. This application returns a new `GroupBy` dataset with the same keys as the old one but with `StatSmooth` groups instead of old ones.

In [11]:
// generate two lines
// lines formulas
val fA = { x: Double -> 0.02 * x * x * x - 0.2 * x * x + 0.1 * x + 2.1 }
val fB = { x: Double -> -0.1 * x * x * x + 0.5 * x * x - 0.8 }
val xRange = (-500..500).map { it / 100.0 }
val noisesA = NormalDistribution(0.0, 0.05).sample(xRange.size).toList()
val noisesB = NormalDistribution(0.0, 0.2).sample(xRange.size).toList()
val valuesA = xRange.zip(noisesA).map { fA(it.first) + it.second }
val valuesB = xRange.zip(noisesB).map { fB(it.first) + it.second }

val (xsA, ysA) = xRange.zip(valuesA).shuffled(Random(17)).take(xRange.size * 1 / 3).sortedBy { it.first }.unzip()
val (xsB, ysB) = xRange.zip(valuesB).shuffled(Random(17)).take(xRange.size * 1 / 3).sortedBy { it.first }.unzip()

In [23]:
// gather them into `DataFrame` with "A" and "B" keys in "category" column
val valuesDF = dataFrameOf(
    "time" to xsA + xsB,
    "value" to ysA + ysB,
    "category" to List(xsA.size) { "A" } + List(xsB.size) { "B" }
)
valuesDF.head(5)

It has the following signature:

<table>
  <thead>
    <tr>
      <th>time</th>
      <th>value</th>
      <th>category</th>
    </tr>
  </thead>
</table>

In [24]:
// group it by "category"
val groupedDF = valuesDF.groupBy { category }
groupedDF

Now we have a `GroupBy` with a signature

<table>
  <thead>
    <tr>
      <th>key: [category]</th>
      <th>group: DataFrame[time|value|category]</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>A</td>
      <td>A-Group</td>
    </tr>
    <tr>
      <td>B</td>
      <td>B-Group</td>
    </tr>
  </tbody>
</table>

In [25]:
groupedDF.statSmooth { x(time); y(value) }

After `statSmooth` applying it's still a `GroupBy` but with different signature of `group` - all groups have the same signature as usual `DataFrame` after `statSmooth` applying (i.e. `StatSmoothFrame`):

<table>
  <thead>
    <tr>
      <th>key: [drv]</th>
      <th>group: StaSmoothFrame</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>"A"</td>
      <td>"A"-Group</td>
    </tr>
    <tr>
      <td>"B"</td>
      <td>"B"-Group</td>
    </tr>
  </tbody>
</table>

As you can see, we did indeed do a `statSmooth` transformation within groups, the grouping keys did not change.

The plotting process doesn't change much - we do everything the same. 

In [29]:
groupedDF.plot {
    statSmooth(time, value) {
        line {
            x(Stat.x)
            y(Stat.y)
        }
    }
}

As you can see, there are two lines because we have two groups of data. To distinguish them, we need to add mapping to the color from the key. This is easy - key is available in the context

In [33]:
groupedDF.plot {
    statSmooth(time, value, method = SmoothMethod.Polynomial(3)) {
        line {
            x(Stat.x)
            y(Stat.y)
            color(key.category)
        }
    }
}

The `smoothLine()` layer also works. Moreover, if we have exactly one grouping key, a mapping from it to `color` will be created by default.

In [34]:
groupedDF.plot {
    smoothLine(time, value)
}

We can customize it like we used to. From the differences — access to `key` columns:

In [40]:
groupedDF.plot {
    smoothLine(time, value) {
        color = Color.GREEN
        type(key.category) {
            scale = categorical()
        }
    }
}//.toLetsPlot()

Also we can stack areas (for that we need `x` coordinates to match - use `trim = true`):

In [19]:
groupedRangesDF.plot {
    // use trim
    densityPlot(column<Double>("range"), trim = true) {
        // // adjust position of areas from diffrente groups
        position = Position.stack()
        alpha = 0.8
    }
}

Line_30.jupyter.kts (1:1 - 16) Unresolved reference: groupedRangesDF
Line_30.jupyter.kts (3:17 - 40) Type mismatch: inferred type is ColumnAccessor<Double> but Iterable<Number> was expected
Line_30.jupyter.kts (3:55 - 7:6) Type mismatch: inferred type is () -> Unit but BandWidth was expected
Line_30.jupyter.kts (5:9 - 17) Unresolved reference. None of the following candidates is applicable because of receiver type mismatch: 
public var LayerContextInterface.position: Position defined in org.jetbrains.kotlinx.kandy.letsplot.position
Line_30.jupyter.kts (6:9 - 14) Unresolved reference: alpha

`densityPlot` plot for `GroupBy` (i.e. `GroupBy.densityPlot(statDensityArgs)` extensions) works as well:

In [20]:
groupedRangesDF.densityPlot("range", bandWidth = BandWidth.value(10.0))

Line_31.jupyter.kts (1:1 - 16) Unresolved reference: groupedRangesDF

... and can be configured the same way:

In [21]:
groupedRangesDF.densityPlot(n = 750, trim = true, adjust = 0.75) { x(range) }.configure {
    alpha = 0.6
    position = Position.stack()
    // can access key column by name as `String`
    fillColor("category") { scale = categoricalColorBrewer(BrewerPalette.Qualitative.Dark2) }
}

Line_32.jupyter.kts (1:1 - 16) Unresolved reference: groupedRangesDF
Line_32.jupyter.kts (1:68 - 69) Unresolved reference. None of the following candidates is applicable because of receiver type mismatch: 
public fun PlotContext.x(column: String, parameters: LetsPlotPositionalMappingParametersContinuous<Any?>.() -> Unit = ...): PositionalMapping<Any?> defined in org.jetbrains.kotlinx.kandy.letsplot
public fun <T> PlotContext.x(values: Iterable<TypeVariable(T)>, name: String? = ..., parameters: LetsPlotPositionalMappingParametersContinuous<TypeVariable(T)>.() -> Unit = ...): PositionalMapping<TypeVariable(T)> defined in org.jetbrains.kotlinx.kandy.letsplot
public fun <T> PlotContext.x(values: DataColumn<TypeVariable(T)>, parameters: LetsPlotPositionalMappingParametersContinuous<TypeVariable(T)>.() -> Unit = ...): PositionalMapping<TypeVariable(T)> defined in org.jetbrains.kotlinx.kandy.letsplot
public fun <T> PlotContext.x(column: ColumnReference<TypeVariable(T)>, parameters: LetsPlotPo

### Inside `groupBy{}` plot context

We can apply `groupBy` modification to the initial dataset and build a histogram with grouped data the same way:

In [22]:
rangesDF.plot {
    groupBy(category) {
        densityPlot(range)
    }
}

Line_33.jupyter.kts (1:1 - 9) Unresolved reference: rangesDF
Line_33.jupyter.kts (2:5 - 12) Unresolved reference. None of the following candidates is applicable because of receiver type mismatch: 
public fun <T> DataFrame<TypeVariable(T)>.groupBy(vararg cols: String): GroupBy<TypeVariable(T), TypeVariable(T)> defined in org.jetbrains.kotlinx.dataframe.api
public fun <T> DataFrame<TypeVariable(T)>.groupBy(vararg cols: KProperty<*>): GroupBy<TypeVariable(T), TypeVariable(T)> defined in org.jetbrains.kotlinx.dataframe.api
public fun <T> DataFrame<TypeVariable(T)>.groupBy(vararg cols: AnyColumnReference /* = ColumnReference<*> */, moveToTop: Boolean = ...): GroupBy<TypeVariable(T), TypeVariable(T)> defined in org.jetbrains.kotlinx.dataframe.api
public fun <T> DataFrame<TypeVariable(T)>.groupBy(moveToTop: Boolean = ..., cols: ColumnsSelector<TypeVariable(T), *> /* = ColumnsSelectionDsl<TypeVariable(T)>.(it: ColumnsSelectionDsl<TypeVariable(T)>) -> ColumnSet<*> */): GroupBy<TypeVariable(T), 