[![Binder](https://mybinder.org/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/jolin-io/fall-in-love-with-julia/main?filepath=07%20streaming%20analytics%20-%2001%20OnlineStats.ipynb)

![](https://www.juliaperformance.com/assets/JuliaPerformance/banner-clouds.jpg)
<center><br></center>
<center><a href="www.juliaperformance.com">www.juliaperformance.com</a></center>
<center><em>making Julia the standard</em></center>

I am very happy to share with you that I started a new consultancy company with the sole focus on applying Julia technology.
<center>🍾🥂🎈🎉🔥</center>

# Introduction to streaming analytics in Julia with OnlineStats.jl

Welcome to this little Jupyter Notebook for getting to know real-time processing in Julia.

Disclaimer: All examples from this notebook are adapted from OnlineStats documentation. This notebook is an overall summary to OnlineStats, especially tested and adapted so that you have a great hands-on experience with this binder jupyter notebook.

[![OnlineStats](https://joshday.github.io/OnlineStats.jl/latest/assets/logo.svg)](https://joshday.github.io/OnlineStats.jl/latest/)

[OnlineStats.jl](https://joshday.github.io/OnlineStats.jl/latest/)

Online Algorithms for Statistics, Models, and Big Data Viz

- ⚡ High-performance single-pass algorithms for statistics and data viz.
- ➕ Updated one observation at a time.
- ✅ Algorithms use O(1) memory.
- 📈 Perfect for streaming and big data.

The perfect building blocks you need for your streaming analysis.

# Overview
1. [Basics I](#Basics-I)
2. [Basics II](#Basics-II)
3. [Visualizations](#Visualizations)
4. [Streaming and big data](#Streaming-and-big-data)
5. [Machine learning](#Machine-learning)

In [None]:
using Plots
using IJulia
using Random
using BenchmarkTools
using Base.Iterators
using Statistics
plotlyjs()

# Basics I

In [None]:
using OnlineStats

In [None]:
m = Mean()

In [None]:
supertypes(typeof(m))

Stats are subtypes of `OnlineStat{T}` where `T` is the type of a single observation.

## Fit

In [None]:
ys = randn(10)

In [None]:
for y in ys
    fit!(m, y)
    println(m)
end

In [None]:
value(m)

Stats can be updated with single or multiple observations e.g. `fit!(m, 1)` and `fit!(m, [1,2,3])`.

In [None]:
ys2 = randn(10) .+ 5

In [None]:
m2 = Mean()
fit!(m2, ys2)

In [None]:
m2.μ

## Merge

Stats can be merged.

In [None]:
merge!(m, m2)

### 👈 Now it is your time: Try out some other models!

E.g. `Variance`.

There is support for
- [Univeriate Statistics](https://joshday.github.io/OnlineStats.jl/latest/stats_and_models/#Univariate-Statistics)
- [Time Series](https://joshday.github.io/OnlineStats.jl/latest/stats_and_models/#Time-Series)
- [Density Estimation](https://joshday.github.io/OnlineStats.jl/latest/stats_and_models/#Parametric-Density-Estimation)
- [Nonparametric Density Estimation](https://joshday.github.io/OnlineStats.jl/latest/stats_and_models/#Nonparametric-Density-Estimation)
- [Machine/Statistical Learning](https://joshday.github.io/OnlineStats.jl/latest/stats_and_models/#Machine/Statistical-Learning)
- [Other](https://joshday.github.io/OnlineStats.jl/latest/stats_and_models/#Other)

An overview of available models can be found [here in the OnlineStats.jl documentation](https://joshday.github.io/OnlineStats.jl/latest/stats_and_models/), or [here in the OnlineStats.jl API overview](https://joshday.github.io/OnlineStats.jl/latest/api/).

In [None]:
# your space

-----------

**Example:** Efficiently counting unique elements (approx).

In [None]:
o = HyperLogLog()
fit!(o, rand(1:100, 10^6))
o

**Example:** Nonparametric Density Estimation

In [None]:
o = fit!(Hist(-5:.1:5), randn(10^6))

# approximate statistics
using Statistics
@show mean(o)
@show var(o)
@show std(o)
@show quantile(o)
@show median(o)
@show extrema(o)
o

# Basics II

## Stack

Stats can be combined.

![drawing](https://user-images.githubusercontent.com/8075494/57342826-bf088c00-710e-11e9-9ac0-f3c1e5aa7a7d.png)

Series

In [None]:
y = rand(1000)
s = Series(Mean(), Variance())
fit!(s, y)

Group

In [None]:
g = Group(Mean(), CountMap(Bool))
itr = zip(randn(100), rand(Bool, 100))
fit!(g, itr)

GroupBy

In [None]:
x = rand(Bool, 10^5)
y = x .+ randn(10^5)
fit!(GroupBy(Bool, Series(Mean(), Extrema())), zip(x,y))

## FilterTransform

Combine your aggregations with filters and transformations.

In [None]:
using OnlineStats: SkipMissing

In [None]:
s = SkipMissing(Series(Mean(), Variance()))
fit!(s, [-1, missing, 2, 1, 9])

The generic case is handled with `FilterTransform`

In [None]:
using OnlineStats: FilterTransform

In [None]:
T = Union{Missing,Number}
s = FilterTransform(Series(Mean(), Variance()), T, filter = !ismissing, transform = abs)
fit!(s, [-1, missing, 2, 1, 9])

alternative way of writing filter and transform expressions by using `=>`

In [None]:
o = FilterTransform(String => (x -> startswith(x, "-")) => (x -> parse(Int,x)) => Series(Mean(), Variance()))
fit!(o, convert.(String, split("1,2,3,-1,4,-5,1,2,-3,-1,2,3", ",")))

For building failure-resistent pipelines remember that there is `TryCatch` 

In [None]:
using OnlineStats: TryCatch

In [None]:
o = TryCatch(Mean())
fit!(o, [1, missing, 3])
o.errors

In [None]:
o.stat

## Weights

Control how to react to changing data.

![weight example gif](https://user-images.githubusercontent.com/8075494/57347308-d4d27d00-711f-11e9-8fbe-fc4523b96b48.gif)

In [None]:
y = randn(100);

In [None]:
fit!(Mean(weight = ExponentialWeight(0.2)), y)

In [None]:
fit!(Mean(weight = x -> 0.2), y)

In [None]:
fit!(Mean(weight = EqualWeight()), y)

## 👈 your time: What is the function which makes for equal weights?

In [None]:
# your space

# fit!(Mean(weight = x -> ...? ), y)

[weight documentation](https://joshday.github.io/OnlineStats.jl/latest/weights/) (try a bit harder before you look up the solution)

---------

Awesome. Now you are already capable of writing complex stream analysis pipelines!

🎈Congratulations🎉

# Visualizations

Summarize your data beautifully with OnlineStats.jl.

## Trace

Record how your OnlineStat got fitted.

In [None]:
y = range(1, 20, length=10^6) .* randn(10^6)
o = Trace(Extrema())
fit!(o, y)
plot(o)

### 👈 your time: plot a trace with three different OnlineStats at once

In [None]:
# your space

<center title="use Series(Extreme, ...)"><em>hover for a hint</em></center>

## Histograms

Summarize the distribution of your data.

In [None]:
s = fit!(Series(KHist(25), Hist(-5:.2:5), ExpandingHist(100), Ash(ExpandingHist(1000))), randn(10^6))
plot(s, link = :x, label = ["KHist" "Hist" "ExpandingHist" "Ash"])

In [None]:
?ExpandingHist

## Summarizing 2D

#### Partition: 2D Nobs * continuous|categorical

In [None]:
y = cumsum(randn(10^6)) + 100randn(10^6)
o = Partition(KHist(10), 50)
fit!(o, y)
plot(o)

In [None]:
o = Partition(Series(Mean(), Extrema()), 50)
fit!(o, y)
plot(o)

works with categorical data too

In [None]:
y = [
    rand(["a", "a", "b", "c"], 10^3)
    rand(["a", "b", "b", "d"], 10^3)
    rand(["c", "b", "d"], 10^3)
]
o = Partition(CountMap(String), 75)
fit!(o, y)
plot(o)

#### IndexedPartition: 2D continuous * continuous|categorical

The `Partition` type can only track the number of observations in the x-axis. If you wish to plot one variable against another, you can use an `IndexedPartition`.

There is [OnlineStats.IndexedPartition](https://joshday.github.io/OnlineStats.jl/latest/dataviz/#Indexed-Partitions) and [OnlineStats.KIndexedPartition](https://joshday.github.io/OnlineStats.jl/latest/dataviz/#K-Indexed-Partitions). Both are very similar, see the documentation for details.

In [None]:
x = randn(10^6)
y = x + randn(10^6)

o = fit!(IndexedPartition(Float64, KHist(10), 50), zip(x, y))
plot(o)

In [None]:
x = [rand(10^3);      rand(10^3) .* 2]
y = [rand(1:5, 10^3); rand([1, 1, 1, 2, 2], 10^3)]
o = fit!(IndexedPartition(Float64, CountMap(Int), 50), zip(x,y))
plot(o, xlab = "X", ylab = "Y")

#### Moasic: 2D categorical x categorical

In [None]:
using RDatasets
t = dataset("ggplot2", "diamonds")
o = Mosaic(eltype(t.Cut), eltype(t.Color))

fit!(o, zip(t.Cut, t.Color))
plot(o, legendtitle="Color", xlabel="Cut")

#### HeatMap: 2D continuous * continuous

In [None]:
# activating standard Plots backend
# for some reason the HeatMap does not work with plotlyjs()
gr()

nframes = 100
o = HeatMap(-5:.2:5, 0:.2:10)  # xedges, yedges

@gif for i in 1:nframes # for more animations see https://docs.juliaplots.org/latest/animations/ 
    x = randn(5i)
    y = randexp(5i)
    fit!(o, zip(x,y))
    plot(o)
end

In [None]:
plotlyjs()  # again back to plotlyjs

# Streaming and big data

Streaming analytics can be used wherever normal analytics applies, but its unique advantage get's crystal clear when it comes to data which is too big to fit into memory.

In this example, we'll calculate some statistics from a 55-Million row CSV file provided by kaggle. 

| Field | Description |
| ----- | :---------- |
| **key** | Unique string identifying each row in both the training and test sets. Comprised of pickup_datetime plus a unique integer, but this doesn't matter, it should just be used as a unique ID field. |
| **pickup_datetime** | timestamp value indicating when the taxi ride started. |
| **pickup_longitude** | float for longitude coordinate of where the taxi ride started. |
| **pickup_latitude** | float for latitude coordinate of where the taxi ride started. |
| **dropoff_longitude** | float for longitude coordinate of where the taxi ride ended. |
| **dropoff_latitude** | float for latitude coordinate of where the taxi ride ended. |
| **passenger_count** | integer indicating the number of passengers in the taxi ride. |
| **fare_amount** | float dollar amount of the cost of the taxi ride. This value is only in the training set; this is what you are predicting in the test set. |

[kaggle data description](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data)

In [None]:
run(`kaggle competitions download -c new-york-city-taxi-fare-prediction -p data/`)

In [None]:
run(`unzip data/new-york-city-taxi-fare-prediction.zip -d data/`)  # takes about 1 minute

In [None]:
; ls -l -h data

This CSV is already too big to be loaded into memory on most machines. Still we can work with it with OnlineStats.jl without any problem

In [None]:
using CSV

[CSV documentation](https://csv.juliadata.org/stable/)
> CSV.Rows: an alternative approach for consuming delimited data, where the input is only consumed one row at a time, which allows "streaming" the data with a lower memory footrpint than CSV.File.

This enables us to read too big CSV files with ease:

In [None]:
@btime sum(1 for row in CSV.Rows("data/train.csv", reusebuffer=true, limit=1000000))

In [None]:
@btime begin
    c = Counter(CSV.Row2)
    rows = CSV.Rows("data/train.csv", reusebuffer=true, limit=1000000)
    fit!(c, rows)
end

In [None]:
@btime begin
    c = Counter(CSV.Row)
    rows = CSV.File("data/train.csv", limit=1000000)
    fit!(c, rows)
end

In [None]:
@btime begin
    m = Mean()
    rows = CSV.Rows("data/train.csv", reusebuffer=true, limit=1000000)
    itr = (parse(Int64, r.passenger_count) for r in rows)
    fit!(m, itr)
end

In [None]:
@btime begin
    m = Mean()
    rows = CSV.File("data/train.csv", limit=1000000)
    itr = (r.passenger_count for r in rows)
    fit!(m, itr)
end

## 👈 your time: Run another OnlineStat over the file

you can be creative (you don't need to use `@btime` - it slows things down)

In [None]:
# your space

-------------

Let's finish with a slightly more complex statistics.

In [None]:
rows = CSV.Rows("data/train.csv", reusebuffer=true)
itr = (convert(String, row.passenger_count) => parse(Float64, row.fare_amount) for row in rows)
collect(take(itr, 10))

In [None]:
o = GroupBy(String, Hist(0.0:100))
fit!(o, take(itr, 1000000))

sort!(o)
plots = [plot(o.value[key], title="#passenger = $key") for key in keys(o.value)]
plot(plots..., link = :all, legend = false, yticks = false, titlefont = font(10), plot_title = "Distribution of taxi price")

## Distributed

In [None]:
using Distributed
addprocs(2)
@everywhere using OnlineStats
nprocs()

![comparing squential vs parallel](https://user-images.githubusercontent.com/8075494/57345083-95079780-7117-11e9-81bf-71b0469f04c7.png)

Simplified (Not Actually in Parallel)

In [None]:
y1 = randn(10_000)
y2 = randn(10_000)
y3 = randn(10_000)

a = Series(Mean(), Variance(), KHist(20))
b = Series(Mean(), Variance(), KHist(20))
c = Series(Mean(), Variance(), KHist(20))

fit!(a, y1)
fit!(b, y2)
fit!(c, y3)

merge!(a, b)  # merge `b` into `a`
merge!(a, c)  # merge `c` into `a`

In Parallel

In [None]:
s = @btime @distributed merge for i in 1:3
    o = Series(Mean(), Variance(), KHist(20))
    fit!(o, randn(10_000))
end

## 👈 your time: @btime a sequential version

little extra: write so that we can easily increase the for loop from `1:3` to `1:15` in order to compare different data sizes.

In [None]:
# your space

**Note about the benchmark:** On binder there is not much of a difference. On my local computer, I get a 2x speedup.

# Machine learning

## Confidence Interval Estimation - Bootstrapping

Bootstrapping is a nonparametric method to estimate confidence intervals on arbitrary statistics. See [bootstrapping on wikipedia](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) for more details.

In [None]:
o = Bootstrap(Variance())
fit!(o, randn(1000))
confint(o, .95)

### 👈 your time: Get the confidence interval for another statistic

In [None]:
# your space

## TimeSeries

In [None]:
y = cumsum(randn(100))
o = AutoCov(5)
fit!(o, y)
autocov(o)
autocor(o)

## Unsupervised Learning

#### Covariance

In [None]:
o = fit!(CovMatrix(), randn(100, 4) |> eachrow)
plot(o, yflip=true)

#### KMeans

In [None]:
x = [randn() + 5i for i in rand(Bool, 10^6), j in 1:2]
o = fit!(KMeans(2), eachrow(x)) 
sort!(o; rev=true)  # Order clusters by number of observations

#### Online PCA

In [None]:
# Project 10-dimensional vectors into 2D
o = CCIPCA(2, 10)

In [None]:
u1 = rand(10)
fit!(o, u1)
u2 = rand(10)
fit!(o, u2)

In [None]:
# Project u3 into PCA space fitted to u1 and u2 but don't change the projection
u3 = rand(10)
OnlineStats.transform(o, u3)

In [None]:
# Fit u4 and then project u4 into the space
u4 = rand(10)
OnlineStats.fittransform!(o, u4)

In [None]:
# Sort from high to low eigenvalues
sort!(o)

In [None]:
# Get primary (1st) eigenvector
o[1]

In [None]:
# Get the variation (explained) "by" each eigenvector
OnlineStats.relativevariances(o)

## Supervised Learning

#### Linear regression

In [None]:
x = randn(100, 5)
y = x * (1:5) + randn(100)
o = fit!(LinReg(), zip(eachrow(x),y))
coef(o)

In [None]:
@show test = randn(5)
predict(o, test)

#### Trees and Forests

In [None]:
x = randn(10^5, 10)
y = rand([1,2], 10^5)
o = fit!(FastTree(10), zip(eachrow(x),y))
xi = randn(10)
classify(o, xi)

In [None]:
x, y = randn(10^5, 10), rand(1:2, 10^5)
o = fit!(FastForest(10), zip(eachrow(x),y))
classify(o, x[1,:])

#### Naive Bayes Classifier

In [None]:
# make data
x = randn(10^6, 5)
y = x * [1,3,5,7,9] .> 0

o = NBClassifier(5, Bool)  # 5 predictors with Boolean categories
fit!(o, zip(eachrow(x), y))
plot(o, titlefont=font(10), label=["false" "true"])

#### 👈 your time: predict and classify with our NBClassifier

In [None]:
# your space

#### StatLearn

A flexible mini framework for online learning of machine learning model via stochatics approximations. Many different models can be represented, e.g. LASSO regression.

Please consult the [documentation](https://joshday.github.io/OnlineStats.jl/latest/ml/) for more information and examples.

# Thank you for joining today!

As always, you are welcome to reach out if you have questions.

stephan.sahm@juliaperformance.com, now CEO of JuliaPerformance

![](https://www.juliaperformance.com/assets/JuliaPerformance/banner-clouds.jpg)
<center><br></center>
<center><a href="www.juliaperformance.com">www.juliaperformance.com</a></center>
<center><em>making Julia the standard</em></center>