# Numerical and statistical tools

- In this notebook we cover packages that didn't have a home in one of the other sections
- These include packages for computing derivatives, basic statistics, handling data and more

## Distributions.jl

- In my opinion Distributions.jl is one of the best examples of flexible, performant, and idiomatic Julia code
- Provides routines for working with probability distributions and...
    - Computing moments/statistics: mean, median, mode, entropy, mgf, quantile
    - Probability evaluation: pdf, cdf, ccdf, quantile, invlogcdf
    - Sampling: rand and rand!

In [1]:
# Pkg.add("Distributions")

### Distributions.jl Basics

In [4]:
using Distributions

In [5]:
# all subtypes of `Distributions.Distribution`
length(subtypes(Distribution))

67

In [6]:
?Normal  # good documentation

search: [1mN[22m[1mo[22m[1mr[22m[1mm[22m[1ma[22m[1ml[22m [1mn[22m[1mo[22m[1mr[22m[1mm[22m[1ma[22m[1ml[22mize [1mn[22m[1mo[22m[1mr[22m[1mm[22m[1ma[22m[1ml[22mize! [1mN[22m[1mo[22m[1mr[22m[1mm[22m[1ma[22m[1ml[22mCanon [1mn[22m[1mo[22m[1mr[22m[1mm[22m[1ma[22m[1ml[22mize_string



```
Normal(μ,σ)
```

The *Normal distribution* with mean `μ` and standard deviation `σ` has probability density function

$$
f(x; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}}
\exp \left( - \frac{(x - \mu)^2}{2 \sigma^2} \right)
$$

```julia
Normal()          # standard Normal distribution with zero mean and unit variance
Normal(mu)        # Normal distribution with mean mu and unit variance
Normal(mu, sig)   # Normal distribution with mean mu and variance sig^2

params(d)         # Get the parameters, i.e. (mu, sig)
mean(d)           # Get the mean, i.e. mu
std(d)            # Get the standard deviation, i.e. sig
```

External links

  * [Normal distribution on Wikipedia](http://en.wikipedia.org/wiki/Normal_distribution)


In [7]:
dists = [
    Normal(0, 1),
    Beta(1.0, 2.0),
    Chisq(5),
    Frechet(5.0, 2.0),
    Gamma(1.0, 2.0),
    Pareto(3.0, 2.0),
    Binomial(10, 0.6),
    Poisson(0.7),
    MvLogNormal(ones(2), 3*eye(2)),
    Dirichlet([0.1, 0.2, 0.3, 0.4]),
    InverseWishart(5, eye(2)),
    MixtureModel(Normal[
        Normal(-2.0, 1.2),
        Normal(0.0, 1.0),
        Normal(3.0, 2.5)], 
        [0.2, 0.5, 0.3]  # prior
    )
]

for d in dists
    println("Working with distribution: $(repr(d))")
    @show mean(d)
    if isa(d, Distributions.UnivariateDistribution)
        @show rand(d, 2, 2)
    else
        @show rand(d, 2)
    end
    
    @show pdf(d, rand(d))
    println("\n\n\n")
end

Working with distribution: Distributions.Normal{Float64}(μ=0.0, σ=1.0)
mean(d) = 0.0
rand(d,2,2) = [-0.445871 -0.681068; -1.25226 0.941212]
pdf(d,rand(d)) = 0.2737153249294948




Working with distribution: Distributions.Beta{Float64}(α=1.0, β=2.0)
mean(d) = 0.3333333333333333
rand(d,2,2) = [0.00249029 0.673038; 0.26132 0.0713112]
pdf(d,rand(d)) = 1.7747200660175333




Working with distribution: Distributions.Chisq{Float64}(ν=5.0)
mean(d) = 5.0
rand(d,2,2) = [3.00729 1.69783; 4.74108 6.10545]
pdf(d,rand(d)) = 0.15179813984353507




Working with distribution: Distributions.Frechet{Float64}(α=5.0, θ=2.0)
mean(d) = 2.3284594274506065
rand(d,2,2) = [2.77902 1.62849; 2.41353 2.66095]
pdf(d,rand(d)) = 0.9113177551185412




Working with distribution: Distributions.Gamma{Float64}(α=1.0, θ=2.0)
mean(d) = 2.0
rand(d,2,2) = [2.90441 0.265366; 3.58685 0.968195]
pdf(d,rand(d)) = 0.3730812977457815




Working with distribution: Distributions.Pareto{Float64}(α=3.0, θ=2.0)
mean(d) = 3.0
rand(d,2,2

### More than you need


Let's list all the available distributions, by type of distribution

In [19]:
dist_types = [
    Distributions.DiscreteMatrixDistribution,
    Distributions.DiscreteMultivariateDistribution,
    Distributions.DiscreteUnivariateDistribution,
    Distributions.ContinuousMatrixDistribution,
    Distributions.ContinuousMultivariateDistribution,
    Distributions.ContinuousUnivariateDistribution,   
]

for T in dist_types
    println("$T: ")
    @show subtypes(T)
    println("\n\n")
end 

Distributions.Distribution{Distributions.Matrixvariate,Distributions.Discrete}: 
subtypes(T) = Any[Distributions.AbstractMixtureModel{Distributions.Matrixvariate,Distributions.Discrete,C<:Distributions.Distribution}]



Distributions.Distribution{Distributions.Multivariate,Distributions.Discrete}: 
subtypes(T) = Any[Distributions.AbstractMixtureModel{Distributions.Multivariate,Distributions.Discrete,C<:Distributions.Distribution},Distributions.DirichletMultinomial{T<:Real},Distributions.Multinomial{T<:Real}]



Distributions.Distribution{Distributions.Univariate,Distributions.Discrete}: 
subtypes(T) = Any[Distributions.AbstractMixtureModel{Distributions.Univariate,Distributions.Discrete,C<:Distributions.Distribution},Distributions.Bernoulli{T<:Real},Distributions.BetaBinomial{T<:Real},Distributions.Binomial{T<:Real},Distributions.Categorical{T<:Real},Distributions.DiscreteUniform,Distributions.Geometric{T<:Real},Distributions.Hypergeometric,Distributions.NegativeBinomial{T<:Real},Distr

In [20]:
# fitting a distribution, given some samples
fit_mle(Normal, randn(100_000)) # should get close to N(0, 1)

Distributions.Normal{Float64}(μ=0.00813106161701842, σ=0.9997411454596623)

In [21]:
# do fitting with mle
fit_mle(Uniform, rand(100_000) .* 2 .+ 1) # should get close to U(1, 3)

Distributions.Uniform{Float64}(a=1.0000018376838407, b=2.9999855975667935)

## Calculus.jl

- Computes analytical derivatives of Julia `Expr`essions and accurate numerical derivatives of functions

In [22]:
# Pkg.add("Calculus")

### Calculus.jl Basics

In [23]:
using Calculus

[1m[34mINFO: Recompiling stale cache file /Users/sglyon/.julia/lib/v0.5/Calculus.ji for module Calculus.
[0m

#### Symbolic derivatives

In [24]:
differentiate(:(sin(x)), :x)

:(cos(x))

In [25]:
differentiate(:(cos(sin(y))), :y)

:(cos(y) * -(sin(sin(y))))

In [27]:
differentiate(:(c^(1-γ)/(1-γ)), :c)

:(((1 - γ) * c ^ ((1 - γ) - 1)) / (1 - γ))

#### Finite difference

In [29]:
derivative(sin, 1.0) - cos(1.0)

-5.036193684304635e-12

In [30]:
second_derivative(sin, 1.0) + sin(1.0)

-6.647716624952338e-7

In [35]:
Calculus.gradient(x -> exp(x[1]) + sin(x[2]) / x[1], [1.0, π])

2-element Array{Float64,1}:
  2.71828
 -1.0    

In [41]:
Calculus.hessian(x -> exp(x[1]) + sin(x[2]) / x[1], [1.0, π])

2×2 Array{Float64,2}:
 2.71828  1.0       
 1.0      1.71123e-7

In [40]:
Calculus.jacobian(x -> [exp(x[1]),  sin(x[2]) / x[1]], [1.0, π], :central)

2×2 Array{Float64,2}:
  2.71828       0.0
 -1.22465e-16  -1.0

## SymEngine.jl

- Next generation C++ backend for sympy computer algebra system
- A very fast alternative to Calculus.jl for symbolic differentiation

In [28]:
# Pkg.add("SymEngine")

### SymEngine.jl Basics

In [42]:
using SymEngine

[1m[34mINFO: Recompiling stale cache file /Users/sglyon/.julia/lib/v0.5/SymEngine.ji for module SymEngine.
[0m

In [47]:
# needs first argument to be of type SymEngine.Basic
diff(Basic(:(sin(x))), :x)

cos(x)

In [49]:
diff(Basic("cos(sin(y))"), :y)

-cos(y)*sin(sin(y))

In [51]:
diff(Basic("c^(1-γ)/(1-γ)"), :c)

c^(-γ)

Let's see how fast SymEngine is compared to Calculus.jl

To do this we will load the BenchmarkTools.jl package that goes to great lengths to produce statistically accurate and robust timing estimates at the sub-microsecond level

In [58]:
# Pkg.add("BenchmarkTools")
using BenchmarkTools

[1m[34mINFO: Recompiling stale cache file /Users/sglyon/.julia/lib/v0.5/JLD.ji for module JLD.


In [65]:
@benchmark Calculus.differentiate(:((y + r*a - ap)^(1-γ)/(1-γ)), :ap)

BenchmarkTools.Trial: 
  memory estimate:  69.14 KiB
  allocs estimate:  685
  --------------
  minimum time:     540.176 μs (0.00% GC)
  median time:      574.756 μs (0.00% GC)
  mean time:        628.544 μs (1.40% GC)
  maximum time:     5.040 ms (79.05% GC)
  --------------
  samples:          7822
  evals/sample:     1

In [66]:
@benchmark diff(Basic("(y + r*a - ap)^(1-γ)/(1-γ)"), :ap)

BenchmarkTools.Trial: 
  memory estimate:  368 bytes
  allocs estimate:  10
  --------------
  minimum time:     49.156 μs (0.00% GC)
  median time:      53.615 μs (0.00% GC)
  mean time:        57.771 μs (0.00% GC)
  maximum time:     340.892 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

## Data handling

- Julia's data picture is young, but still maturing
- Python is still my go-to choice for data cleaning/analysis
- That being said, working with data in Julia is still doable and effective

I won't demo them now, but some the key packages are:

- [DataFrames.jl](https://github.com/JuliaStats/DataFrames.jl): Provides a DataFrame type for handling columnar data
- [CSV.jl](https://github.com/JuliaData/CSV.jl): very high performance reading and writing of delimited data files
- [DataStreams.jl](https://github.com/JuliaData/DataStreams.jl): provide an interface for streaming data from a source to a sink
- [Query.jl](https://github.com/davidanthoff/Query.jl): filter, project, join, group any iterable data source