[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/jolin-io/fall-in-love-with-julia/main?filepath=09%20StatsKit%20-%2001%20overview.ipynb)

<a href="https://www.jolin.io" target="_blank" rel="noreferrer noopener">
<img src="https://www.jolin.io/assets/Jolin/Jolin-Banner-Website-v1.1-darkmode.webp">
</a>

# Fall-in-love-with-Julia: Statistics & co with JuliaStats & StatsKit.jl

a 101 introduction session

In [None]:
using Random; Random.seed!(2022);  # make sure this tutorial is reproducible

from https://github.com/JuliaStats/StatsKit.jl

StatsKit.jl
========

  [![Build status](https://github.com/JuliaStats/StatsKit.jl/workflows/CI/badge.svg)](https://github.com/JuliaStats/StatsKit.jl/actions?query=workflow%3ACI+branch%3Amaster)

This is a convenience meta-package which allows loading essential packages for statistics in one command:
```julia
using StatsKit
```

Currently this loads the [Statistics](https://docs.julialang.org/en/stable/stdlib/Statistics/)
standard library module, and the following packages:

* [Bootstrap](https://github.com/juliangehring/Bootstrap.jl)
* [CategoricalArrays](https://github.com/JuliaData/CategoricalArrays.jl)
* [Clustering](https://github.com/JuliaStats/Clustering.jl)
* [CSV](https://github.com/JuliaData/CSV.jl)
* [DataFrames](https://github.com/JuliaData/DataFrames.jl)
* [Distances](https://github.com/JuliaStats/Distances.jl)
* [Distributions](https://github.com/JuliaStats/Distributions.jl)
* [GLM](https://github.com/JuliaStats/GLM.jl)
* [HypothesisTests](https://github.com/JuliaStats/HypothesisTests.jl)
* [KernelDensity](https://github.com/JuliaStats/KernelDensity.jl)
* [Loess](https://github.com/JuliaStats/Loess.jl)
* [MultivariateStats](https://github.com/JuliaStats/MultivariateStats.jl)
* [MixedModels](https://github.com/JuliaStats/MixedModels.jl)
* [StatsBase](https://github.com/JuliaStats/StatsBase.jl)
* [ShiftedArrays](https://github.com/JuliaArrays/ShiftedArrays.jl)
* [TimeSeries](https://github.com/JuliaStats/TimeSeries.jl)

This package is intended for users of statistics packages who want to get started with one import. Packages themselves should continue
to list individual packages in they dependencies rather than `StatsKit` as a whole.


# helpful statistic data types

## missing

It is like NaN, but better, because it cannot be accidently result from float computations.

In [None]:
missing == missing

In [None]:
missing + 1

In [None]:
2 * missing

In [None]:
using Statistics

In [None]:
mean([missing, 2, 10])

In [None]:
mean(skipmissing([missing, 2, 10]))

does the right thing almost magically

In [None]:
u = [missing, 2, 3]
v = skipmissing(u)
w = collect(v)
@show u v w;

In [None]:
findfirst(x -> x == 2, u)

In [None]:
findfirst(x -> x == 2, v)

In [None]:
findfirst(x -> x == 2, w)

Some extra utils can be found in [Missings.jl](https://github.com/JuliaData/Missings.jl)

## CategoricalArrays.jl

In [None]:
using CategoricalArrays

x = CategoricalArray(["Old", "Young", "Middle", "Young"], ordered=true)

In [None]:
x[1]

In [None]:
unwrap(x[1])

In [None]:
x[1] = "Middle"

In [None]:
x[1]

In [None]:
levels(x)

## 👉 your time 

Change the value of `x[1]` to `"VeryOld"` or any other String not yet in the array.

What do you expect happens? Take a look at `levels`.

In [None]:
# your space

# Statistics.jl (builtin)

https://docs.julialang.org/en/v1/stdlib/Statistics/

| function | description |
| -------- | ----------- |
| `var`    | variance    |
| `std`    | standard deviation |
| `mean`   | average     |
| `median` | 50% quantile |
| `middle` | mean of extremas |
| `cor`    | Pearson correlation |
| `cor`    | covariance  |
| `quantile` | quantiles |


In [None]:
mean(√, [1, 2, 3])

In [None]:
mean([√1, √2, √3])

In [None]:
mean(√,
     [1 2 3
      4 5 6],
     dims=2)

In [None]:
using Plots

In [None]:
randfactor = 1

x1 = collect(1:10) + randfactor*rand(10)
x2 = collect(1:10) + randfactor*rand(10)
plot(x1)
plot!(x2)

Covariance definition

$$
\text{cov}(X, Y) = \frac{1}{n - 1} \sum^n_{i=1}(x_i - \text{mean}(X))(y_i - \text{mean}(Y))
$$

Pearson correlation is a normalized covariance

$$
\text{cor}(X, Y) = \frac{\text{cov}(X, Y)}{\text{std}(X)\text{std}(Y)}
$$

In [None]:
@show cor(x1, x2)
@show cov(x1, x2)
@show cov(x1, x2)/(std(x1)std(x2))

In [None]:
cor([1,2,3], [21, 25, 29])

In [None]:
cor([1  1.0  10  8
     2  1.1   0  6
     3  0.9   5  4])

# StatsBase.jl

https://juliastats.org/StatsBase.jl/stable/

Key addition to builtin Statistics package:

- All functions support weights

| 1-D statistics | description |
| -------- | ----------- |
| `geomean`    | geometric mean    |
| `harmean`    | harmonic mean |
| `genmean`   | mean with generalized power |
| `skewness` | skewness |
| `kurtosis` | kurtosis |
| `moment`    | central moment of arbitrary order |
| `variation`    | ratio of the standard deviation to the mean  |
| `sem` | standard error of mean |
| `mad` | median absolute deviation <br> MAD is to median like variance is to mean |
| `zscore` | zscore |
| `percentile` | like `Statistics.quantile`, but with values from 0 to 100 |
| `iqr` | inter quantile range (75% - 25%) |
| `nquantile` | splitting the whole range in `n` equal quantiles |
| `mode`, `modes` | most common number(s) | 

| multi-D statistics | description |
| -------- | ----------- |
| `partialcor` | partial correlation |
| `genvar` | generalized variance = determinant of covariance matrix |
| `totalvar` | total variance = sum of diagonal of covariance matrix |


From Wikipedia: partial correlation
![partial correlation drawing](https://upload.wikimedia.org/wikipedia/commons/thumb/9/9e/PartialCorrelationGeometrically.svg/512px-PartialCorrelationGeometrically.svg.png)


| counting | description |
| -------- | ----------- |
| `counts` | counts integer |
| `proportions` | frequency of integers |
| `countmap` | counts arbitrary things |
| `proportionmap` | frequency of arbitrary things |
| `ecdf` | empirical cumulative distribution function |
| `Histogram` | histogram, see example above | 
| `levelsmap` | mapping unique values to normalized integer |
| `indexmap` | mapping unique values to first index |
| `indicatormat` | one-hot-encoding |


| probabilities | description |
| -------- | ----------- |
| `entropy` | entropy of probabilities |
| `renyientropy` |  Rényi (generalized) entropy |
| `crossentropy` | cross entropy of two probability vectors |
| `kldivergence` | Kullback-Leibler divergence |


| time series | description |
| -------- | ----------- |
| `autocor` | autocorrelation, normalized `autocov`, `crosscor` with itself |
| `autocov` | autocovariance, `crosscov` with itself |
| `crosscor` | cross correlation |
| `crosscov` | cross covariance |
| `pacf` | partial autocorrelation function <br> (autocorrelation between $z_{t}$ and $z_{t+k}$ <br> that is not accounted for by lags $1$ through $k − 1$)|


From Wikipedia:
> In signal processing, cross-correlation is a measure of similarity of two series as a function of the displacement of one relative to the other. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature.



and more
- [Robust Statistics](https://juliastats.org/StatsBase.jl/stable/robust/) trim, winsor
- [Deviations between two arrays](https://juliastats.org/StatsBase.jl/stable/deviation/) L1dist, L2dist, Linfdist, ...
- [Ranking](https://juliastats.org/StatsBase.jl/stable/ranking/) ordinalrank, denserank, ...
- [Sampling](https://juliastats.org/StatsBase.jl/stable/sampling/) sample

In [None]:
using StatsBase

the builtin `Statistics` package is reused as much as possible

In [None]:
@which StatsBase.std

## summarystats

In [None]:
summarystats(randn(100))

## Histogram

In [None]:
# one observation in the small bin and three in the large
obs = [0.5, 1.5, 1.5, 2.5];

# a small and a large bin
bins = [0, 1, 7];

observe `isdensity = false` and the `weights` field records the number of observations in each bin

In [None]:
h = fit(Histogram, obs, bins)

observe `isdensity = true` and `weights` tells us the number of observation per binsize in each bin

In [None]:
using LinearAlgebra: normalize
normalize(h, mode=:density)

## Standard Transform

In [None]:
standardize(ZScoreTransform,
            [0.0 -0.5 0.5
             0.0  1.0 2.0],
            dims=2)

In [None]:
standardize(UnitRangeTransform,
            [0.0 -0.5 0.5
             0.0  1.0 2.0],
            dims=2)

# Distributions.jl

In [None]:
using Distributions
using Plots

again the builtin `Statistics` package is reused as much as possible

In [None]:
@which Distributions.quantile

same holds for `StatsBase`

In [None]:
@which Distributions.kurtosis

let's use a Gaussian distribution, aka Normal distribution

In [None]:
d = Normal()

In [None]:
plot(x -> pdf(d, x))

In [None]:
fieldnames(typeof(d))

In [None]:
d.μ, d.σ

In [None]:
mean(d), std(d)

In [None]:
d_sample = rand(d, 1000)
histogram(d_sample)

In [None]:
mean(d_sample), std(d_sample)

In [None]:
d_truncated = truncated(Normal(4, 10), -10, 20)

In [None]:
plot(-20:+30, x -> pdf(d_truncated, x))

In [None]:
mean(d_truncated), std(d_truncated)

In [None]:
d_censored = censored(Normal(4, 10), -10, 20)

In [None]:
plot(-20:+30, x -> pdf(d_censored, x))

In [None]:
d_fitted = fit(Normal, d_sample)

What could be the reason that σ is different?

In [None]:
d_mixture = MixtureModel(
    [Normal(-4.0, 1.2)
     Normal(0.0, 1.0)
     Normal(3.0, 1.5)], [0.2, 0.5, 0.3])

In [None]:
plot(x -> pdf(d_mixture, x))

## 👉 your time 

Construct a distribution with two equally probable peaks. 

And get the 0.8 quantile

In [None]:
# your space

# Distances.jl

https://github.com/JuliaStats/Distances.jl

really important to know that this exists

In [None]:
using Distances

The warning means that Distances does not reuse the respective function in StatsBase. The reason is probably that Distances.jl wants to be the ground truth package for distances.

In [None]:
@which Distances.meanad

In [None]:
colwise(Euclidean(), rand(10, 3), rand(10, 3))

In [None]:
m1 = rand(Bool, 3, 4)
display(m1)

m2 = rand(Bool, 3, 2)
display(m2)

pairwise(Hamming(), m1, m2)

# HypothesisTests.jl

https://juliastats.org/HypothesisTests.jl/stable/

In [None]:
using HypothesisTests
x = rand(Normal(), 100)

In [None]:
pvalue(OneSampleTTest(x), tail=:both) # you can also set tail = :left or :right

In [None]:
mean_H0 = 0
pvalue(OneSampleTTest(x, mean_H0))

In [None]:
dist_H0 = Normal()
pvalue(OneSampleADTest(x, dist_H0))

alternatively you can also get the confidence interval of the Null Hypothesis by `confint`

In [None]:
confint(OneSampleTTest(x), level=0.95, tail=:both)  # you can also set tail = :left or :right

## 👉 your time 

Try to change the sample `x` such that the tests give a very low p-value 

In [None]:
# your space

## Do-it-yourself Hypothesis Tests

1. Come up with a test to be computed - should be different between completely random data (H0), and your data.
2. Remember, null hypothesis H0 are best if we can reject them!
3. Think about how your data is distributed if H0 is true
4. sample example data from your H0 distribution and compute your test statistic
5. repeat and look how likely the target test statistic is

Let's do something like a T-Test, checking whether two samples have different mean.

1. let's test the absolute difference between the mean
2. we want to show "different mean", so in order to reject our H0, or H0 should represent "same mean" (i.e. difference == 0)
3. we assume our data is Normal distributed, and under H0 both samples have the same distribution, so let's fit a Normal to all data
4. we sample two samples of same size as the target, and compute our test
5. e.g. we can use `StatsBase.ecdf`

In [None]:
n = 10 

data1 = rand(Normal(0), n)
data2 = rand(Normal(2), n)

In [None]:
my_test_statistic(data1, data2) = abs(mean(data1) - mean(data2))

In [None]:
my_test_statistic(data1, data2)

In [None]:
H0_dist = fit(Normal, [data1; data2])

H0_data1() = rand(H0_dist, n)
H0_data2() = rand(H0_dist, n)

In [None]:
test_statistic_H0_sample = [my_test_statistic(H0_data1(), H0_data2()) for _ in 1:1000000]

In [None]:
test_statistic_H0_cumdist = ecdf(test_statistic_H0_sample)

In [None]:
my_test_statistic(data1, data2)

In [None]:
alpha = 1 - test_statistic_H0_cumdist(my_test_statistic(data1, data2))  # be careful with tail: left, right, both?

alternatively, we can directly look how often we've found a higher value in our H0

In [None]:
mean(my_test_statistic(data1, data2) .< test_statistic_H0_sample)

What would it mean if this alpha estimator returns `0.0`? What would be our alpha?

## 👉 your time 

Change `n` above, and see how our custom hypothesis test performs. Why?

Also change the `data1` and `data2` distributions above.

In [None]:
# your space

# Bootstrap.jl

https://github.com/juliangehring/Bootstrap.jl

At last I want to show you another application of sampling - Bootstrapping.

In bootstrapping we want to find the confidence interval of our estimator.

- Say we have a sample `x`, and compute the mean `mean(x)`, how can we provide some confidence values for our `mean(x)`.

- If we would only know the underlying distribution which generate sample, everything would be very simple, but this is usually unkown.

- As an estimator of the underlying distribution - the population - we just use our sample. We pretend our sample is the population and repeat the same process.

In [None]:
n = 1000
x = rand(Normal(), n)

In [None]:
mean(x)

Now we pretend our sample `x` is a good representation of the entire population. We want to sample from it, with same size as original, hence we need sampling with replacement.

In [None]:
sample(x, n, replace=true)

In [None]:
mean(sample(x, n, replace=true))

we repeat this in order to get a distribution of our estimator

In [None]:
mean_bootstrap_sample = [mean(sample(x, n, replace=true)) for _ in 1:1_000_000]

In [None]:
quantile(mean_bootstrap_sample, [0.2, 0.8])

In [None]:
mean(x)

There we have our minimal confidence interval for our statistical estimator!

## 👉 your time 

Try another estimator above, e.g. variance `var`.

In [None]:
# your space

More details and advanced methods on Statistical Bootstrap can be found at [Bootstrap.jl](https://github.com/juliangehring/Bootstrap.jl)

# Machine Learning related packages, bundled within StatsKit.jl

- [Clustering.jl](https://github.com/JuliaStats/Clustering.jl)
- [GLM.jl (generalized linear models)](https://github.com/JuliaStats/GLM.jl)
- [MixedModels.jl](https://github.com/JuliaStats/MixedModels.jl)
- [MultivariateStats.jl (dimension reduction)](https://github.com/JuliaStats/MultivariateStats.jl)

# Thank you for joining

for questions or suggestions please contact me at stephan.sahm@jolin.io

<a href="https://www.jolin.io" target="_blank" rel="noreferrer noopener">
<img src="https://www.jolin.io/assets/Jolin/Jolin-Banner-Website-v1.1-darkmode.webp">
</a>