# "Smooth" statistics & smooth line plot

Statistics "smooth" are calculated on the sample of two continuous variables (i.e., sample of points or lines).
It interpolates data points to create a smoother curve.

This notebook uses definitions from [DataFrame](https://kotlin.github.io/dataframe/overview.html).

## Usage

The "Smooth" statistic proves beneficial in scenarios with over-plotting or noise,
simplifying the process of identifying inherent trends and patterns.
It can also be used to make a more pretty line with a small number of points.

## Arguments

* Input (mandatory):
    - `x` — numeric sample of input points `x` coordinates
    - `y` — numeric sample of input points `y` coordinates
* Parameters (optional):
    - `method: SmoothMethod` — smoothing model:
        - `SmoothMethod.Linear(confidenceLevel: Double)` — linear model
        - `SmoothMethod.Polynomial(degree: Int, confidenceLevel: Double)` — polynomial model
        - `SmoothMethod.LOESS(span: Double, loessCriticalSize: Int, samplingSeed: Long, confidenceLevel: Double)` —
          Local Polynomial Regression model
    - `smootherPointCount: Int` — number of sampled points

### Generalized signature

The specific signature depends on the function,
but all functions related to "smooth" statistic (which will be discussed further below —
different variations of `statSmooth()`, `smoothLine()`) have approximately the same signature with the arguments above:

```
statSmoothArgs := 
   x, 
   y,
   method: SmoothMethod = SmoothMethod.LOESS(),
   smootherPointCount: Int = 100
```

The possible types of `x` and `y` depend on where a certain function is used.
They can be simply `Iterable` (`List`, `Set`, etc.) or a reference to a column in a `DataFrame`
(`String`, `ColumnAccessor`) or the `DataColumn` itself.

## Output statistics

| name      | type   | description                                          |
|-----------|--------|------------------------------------------------------|
| Stat.x    | Double | `x` coordinate                                       |
| Stat.y    | Double | `y` coordinate                                       |
| Stat.yMin | Double | Lower point-wise confidence interval around the mean |
| Stat.yMax | Double | Upper point-wise confidence interval around the mean |
| Stat.se   | Double | Standard error                                       |

## StatSmooth plots

In [1]:
%useLatestDescriptors
%use dataframe
%use kandy

import org.apache.commons.math3.distribution.NormalDistribution
import org.apache.commons.math3.distribution.UniformRealDistribution
import kotlin.random.Random

In [2]:
// To generate the data, we use a standard java math library
// https://commons.apache.org/proper/commons-math/

// Generate line with formula
val xs = (-100..100).map { it / 50.0 }
val lineFormula = { x: Double -> 2.0 / (x * x + 0.5) }
// Generate noises from normal distribution
val noises = NormalDistribution(0.0, 0.1).sample(xs.size).toList()
val ys = xs.zip(noises).map { lineFormula(it.first) + it.second }
// And drop 2/3 points
val random = Random(42)
val (newXs, newYs) = xs.zip(ys).shuffled(random).take(xs.size * 1 / 3).sortedBy { it.first }.unzip()

In [3]:
// Gather them into the DataFrame
val df = dataFrameOf(
    "speed" to newXs,
    "efficiency" to newYs
)
df.head(5)

speed,efficiency
-1980000,501928
-1920000,431872
-1900000,382416
-1880000,571812
-1780000,534422


`df` has a signature

| speed | efficiency |
|-------|------------|

Let's take a look at `StatSmooth` output DataFrame:

In [4]:
df.statSmooth("speed", "efficiency").head(5)

Stat,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0
x,y,yMin,yMax,se
-1980000,253745,40217,467274,106917
-1940000,298804,88492,509116,105307
-1900000,344317,137204,551430,103705
-1860000,390396,186465,594327,102112
-1820000,437040,236271,637808,100528


It has the following signature:

<table>
  <thead>
    <tr>
      <th alignt="left" colspan="5">Stat</th>
    </tr>
  </thead>
  <thead>
    <tr>
      <th>x</th>
      <th>y</th>
      <th>yMin</th>
      <th>yMax</th>
      <th>se</th>
    </tr>
  </thead>
</table>

As you can see, we got a `DataFrame` with one `ColumnGroup` called `Stat` which contains several columns with statics.
For `statSmooth`, each row corresponds to one of the line points.
`Stat.x` is the column with this point `x` coordinate.
`Stat.y` is points `y` coordinate; `Stat.yMin` — lower point of confidence level.
`Stat.yMax` — upper point of confidence level.
`Stat.se` — standard error.

`DataFrame` with "smooth" statistics is called `StatSmoothFrame`

### `statSmooth` transform

`statSmooth(statSmoothArgs) { /*new plotting context*/ }` modifies a plotting context —
instead of original data (no matter was it empty or not) new `StatSmooth` dataset
(calculated on given arguments. Inputs can be provided as `Iterable` or as dataset column reference —
by name as a `String`, as a `ColumnReference` or as a `DataColumn`) is used inside a new context
(original dataset and primary context are not affected —
you can add layers using initial dataset outside the `statSmooth` context).
Since the old dataset is irrelevant, we cannot use references for its columns.
But we can refer to the new ones.
They are all contained in the `Stat` group and can be called inside the new context:

In [5]:
plot {
    statSmooth(newXs, newYs) {
        // new `StatSmooth` dataset here
        area {
            // use `Stat.*` columns for mappings
            x(Stat.x)
            y(Stat.y)
        }
    }
    points {
        x(newXs)
        y(newYs)
    }
}

In [6]:
df.plot {
    statSmooth(speed, efficiency, method = SmoothMethod.Polynomial(2), smootherPointCount = 250) {
        ribbon {
            x(Stat.x)
            yMin(Stat.yMin)
            yMax(Stat.yMax)
        }
    }
    // Dataset is not changed here
    points {
        x(speed)
        y(efficiency)
    }
}

### `smoothLine` layer

`smoothLine` layer is a shortcut for fast plotting a smoothed line:

In [7]:
val smoothLineLayerPlot = plot {
    smoothLine(newXs, newYs)
    layout.title = "`smoothLine()` layer"
}
smoothLineLayerPlot

In [8]:
// Compare it with `statSmooth` + usual `line`
val statSmoothAndLinePlot = plot {
    statSmooth(newXs, newYs) {
        line {
            x(Stat.x)
            y(Stat.y)
        }
    }
    layout.title = "`statSmooth()` + non-statistical `line` layer"
}
plotGrid(listOf(smoothLineLayerPlot, statSmoothAndLinePlot))

`smoothLine` uses `statSmooth` and `line` and performs coordinate mappings under the hood.
And we can customize `smoothLine` layer: `smoothLine()` optionally opens a new context,
where we can configure the line (as in the usual context opened by `line { ... }`) —
even change coordinate mappings from default ones.
`StatSmooth` dataset of `smoothLine` is also can be accessed here.

In [9]:
df.plot {
    smoothLine(speed, efficiency, SmoothMethod.LOESS(span = 0.1), 120) {
        // change a column mapped on `y` to `Stat.scaled`
        y(Stat.yMax)
        color = Color.RED
        width = 4.0
    }
    points {
        x(speed)
        y(efficiency)
    }
}

## Grouped `statSmoth`

`statSmooth` can be applied for grouped data —
statistics will be calculated on each group independently but with equal categories.
This application returns a new `GroupBy`
dataset with the same keys as the old one but with `StatSmooth` groups instead of old ones.

In [10]:
// Generate two lines
val fA = { x: Double -> 0.02 * x * x * x - 0.2 * x * x + 0.1 * x + 2.1 }
val fB = { x: Double -> -0.1 * x * x * x + 0.5 * x * x - 0.8 }
val xRange = (-500..500).map { it / 100.0 }
val noisesA = NormalDistribution(0.0, 0.05).sample(xRange.size).toList()
val noisesB = NormalDistribution(0.0, 0.2).sample(xRange.size).toList()
val valuesA = xRange.zip(noisesA).map { fA(it.first) + it.second }
val valuesB = xRange.zip(noisesB).map { fB(it.first) + it.second }

val (xsA, ysA) = xRange.zip(valuesA).shuffled(Random(17)).take(xRange.size * 1 / 3).sortedBy { it.first }
    .unzip()
val (xsB, ysB) = xRange.zip(valuesB).shuffled(Random(17)).take(xRange.size * 1 / 3).sortedBy { it.first }
    .unzip()

In [11]:
// Gather them into `DataFrame` with "A" and "B" keys in "category" column
val valuesDF = dataFrameOf(
    "time" to xsA + xsB,
    "value" to ysA + ysB,
    "category" to List(xsA.size) { "A" } + List(xsB.size) { "B" }
)
valuesDF.head(5)

time,value,category
-4960000,-5737070,A
-4890000,-5486676,A
-4870000,-5426111,A
-4840000,-5341519,A
-4830000,-5366671,A


It has the following signature:

| time | value | category |
|------|-------|----------|

In [12]:
// Group it by "category"
val groupedDF = valuesDF.groupBy { category }
groupedDF

category,group,Unnamed: 2_level_0
time,value,category
time,value,category
A,"DataFrame [333 x 3]timevaluecategory-4,960000-5,737070A-4,890000-5,486676A-4,870000-5,426111A-4,840000-5,341519A-4,830000-5,366671A... showing only top 5 of 333 rows",
time,value,category
-4960000,-5737070,A
-4890000,-5486676,A
-4870000,-5426111,A
-4840000,-5341519,A
-4830000,-5366671,A
B,"DataFrame [333 x 3]timevaluecategory-4,96000023,509611B-4,89000022,852575B-4,87000022,903565B-4,84000021,983626B-4,83000021,972353B... showing only top 5 of 333 rows",
time,value,category
-4960000,23509611,B

time,value,category
-4960000,-5737070,A
-4890000,-5486676,A
-4870000,-5426111,A
-4840000,-5341519,A
-4830000,-5366671,A

time,value,category
-4960000,23509611,B
-4890000,22852575,B
-4870000,22903565,B
-4840000,21983626,B
-4830000,21972353,B


Now we have a `GroupBy` with a signature

<table>
  <thead>
    <tr>
      <th>key: [category]</th>
      <th>group: DataFrame[time|value|category]</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>A</td>
      <td>A-Group</td>
    </tr>
    <tr>
      <td>B</td>
      <td>B-Group</td>
    </tr>
  </tbody>
</table>

In [13]:
groupedDF.statSmooth { x(time); y(value) }

category,group,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0
Stat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,y,yMin,yMax,se
Stat,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
x,y,yMin,yMax,se
A,"DataFrame [100 x 1]StatxyyMinyMaxse-4,960000-4,228690-4,296966-4,1604140,034708-4,859394-4,047957-4,115213-3,9807010,034190-4,758788-3,867867-3,934108-3,8016250,033674-4,658182-3,688466-3,753698-3,6232330,033161-4,557576-3,509803-3,574031-3,4455740,032650... showing only top 5 of 100 rows",,,
Stat,,,,
x,y,yMin,yMax,se
-4960000,-4228690,-4296966,-4160414,0034708
-4859394,-4047957,-4115213,-3980701,0034190
-4758788,-3867867,-3934108,-3801625,0033674
-4658182,-3688466,-3753698,-3623233,0033161
-4557576,-3509803,-3574031,-3445574,0032650
B,"DataFrame [100 x 1]StatxyyMinyMaxse-4,96000017,51702517,24783117,7862190,136844-4,85939416,97444616,70927317,2396190,134800-4,75878816,43389516,17272216,6950670,132767-4,65818215,89551915,63832616,1527130,130744-4,55757615,35947915,10624315,6127150,128732... showing only top 5 of 100 rows",,,
Stat,,,,

Stat,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0
x,y,yMin,yMax,se
-4960000,-4228690,-4296966,-4160414,34708
-4859394,-4047957,-4115213,-3980701,34190
-4758788,-3867867,-3934108,-3801625,33674
-4658182,-3688466,-3753698,-3623233,33161
-4557576,-3509803,-3574031,-3445574,32650

Stat,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0
x,y,yMin,yMax,se
-4960000,17517025,17247831,17786219,136844
-4859394,16974446,16709273,17239619,134800
-4758788,16433895,16172722,16695067,132767
-4658182,15895519,15638326,16152713,130744
-4557576,15359479,15106243,15612715,128732


After `statSmooth` applying it's still a `GroupBy` but with different signature of `group` —
all groups have the same signature as usual `DataFrame` after `statSmooth` applying (i.e. `StatSmoothFrame`):

<table>
  <thead>
    <tr>
      <th>key: [drv]</th>
      <th>group: StaSmoothFrame</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>"A"</td>
      <td>"A"-Group</td>
    </tr>
    <tr>
      <td>"B"</td>
      <td>"B"-Group</td>
    </tr>
  </tbody>
</table>

As you can see, we did indeed do a `statSmooth` transformation within groups, the grouping keys did not change.

The plotting process doesn't change much — we do everything the same.

In [14]:
groupedDF.plot {
    statSmooth(time, value) {
        line {
            x(Stat.x)
            y(Stat.y)
        }
    }
}

As you can see, there are two lines because we have two groups of data.
To distinguish them, we need to add mapping to the color from the key.
This is convenient — the key is available in the context

In [15]:
groupedDF.plot {
    statSmooth(time, value, method = SmoothMethod.Polynomial(3)) {
        line {
            x(Stat.x)
            y(Stat.y)
            color(key.category)
        }
    }
}

The `smoothLine()` layer also works.
Moreover, if we have exactly one grouping key, a mapping from it to `color` will be created by default.

In [16]:
groupedDF.plot {
    smoothLine(time, value)
}

We can customize it like we used to. From the differences — access to `key` columns:

In [17]:
groupedDF.plot {
    smoothLine(time, value) {
        color = Color.GREEN
        type(key.category)
    }
}

### Inside `groupBy{}` plot context

We can apply `groupBy` modification to the initial dataset and build a histogram with grouped data the same way:

In [18]:
valuesDF.plot {
    groupBy(category) {
        smoothLine(time, value)
    }
}