## Reading data from ROOT using UpROOT

_For flat data structures, it's very simple to read in data using uproot,
especially when that data can fit in memory. In this example we read a ROOT file
in and store the data in a DataFrame. The data is then split 80/20, with the first
80 used to form the predicted hypothesis/spectrum, and the last 20 used for validation._

### Readiness Checklist
_Once this checklist is completed, it can be removed and this module is complete_
- [x] Read in an arbitrary root file (produced externally)
- [ ] Produce multiple 1D PDFs
- [ ] Produce a 2D PDF
- [ ] Produce a ... 4D PDF?
- [x] Mock dataset to fit to a 1D PDF
- [ ] Fit to the 2D distribution (extended likelihood)
- [ ] Produce uncertainties
- [ ] Bias
- [ ] Pull
- [ ] Are all of these dependencies necessary?
- [ ] Move required dependencies to Project.toml

In [None]:
push!(LOAD_PATH, "../src/")
using Batman
import Random: rand
using DataFrames
using Distributions
using StatsPlots; pyplot()
using PyPlot
import PyCall

import KernelDensity: kde, InterpKDE
import Interpolations; itp = Interpolations
using BenchmarkTools
import Distributions: Normal
import StatsBase: Histogram, fit
import LinearAlgebra: normalize

using KernelDensityEstimate
using KernelDensityEstimatePlotting

### Reading from ROOT
The data is read in with a simple utility `DataStructures.rootreader`, internally this is stored in a `DataFrame`.

In [None]:
# The expected data is drawn from a bivariate distribution. In 1D these are normal
# however, the two variables "energy" and "position" are highly correlated.

# The root files do not exist yet; however, we have a helper python script to create
# them. This is not julia because at the moment, uproot writing does not work.
PyCall.py"""exec(open("assets/batroot.py").read())"""

signalMC = DataStructures.rootreader("assets/signal.root", "bat")
bkgMC = DataStructures.rootreader("assets/background.root", "bat")
marginalhist(signalMC.energy, signalMC.position, seriescolor=:viridis, bins=100)

### Mock data comparison
From the distributions package, we will create our own bivariate dataset that is independent of this root file.

In [None]:
# Create a root data file using only uproot.
signal_truth = 1000
bkg_truth = 350*0
# Signal first
Σ = [[1.0,0.7] [0.7,1.0]]
μ = [2.0, 4.0]

sigPDF = MvNormal(μ, Σ)
# Now bkg
Σ = [[1.0,-0.7] [-0.7,1.0]]
μ = [3.5, 3.5]

bkgPDF = MvNormal(μ, Σ)
function mock_dataset()
    x_signal = rand(sigPDF, rand(Poisson(signal_truth)))
    x_bkg = rand(bkgPDF, rand(Poisson(bkg_truth)))

    x = hcat(x_bkg, x_signal)
    global data = DataFrame(energy=x[1,:], position=x[2,:])
    add_dataset(:data, data)
end
mock_dataset()
marginalhist(data.energy, data.position, seriescolor=:viridis, bins=100)

### PDF from Histogram
Creating a PDF from a dataset can be done when the true underlying distribution is unknown, but simulations exist to estimate this distribution. Many approaches exist to estimate the underlying distribution:
1. Kernel Density Estimate
2. Interpolation + Extrapolation
3. Neural Networks

In [None]:
plt.hist(data.position, bins=100)

In [None]:
#1 KDE
sig_kde = KdePDF(signalMC, :energy)
bkg_kde = KdePDF(bkgMC, :energy)

# 2: Flat interpolation + zero extrapolation
bins = collect(-2:0.1:8)
sig_itp = HistogramPDF(signalMC, :energy; bins=bins)
bkg_itp = HistogramPDF(bkgMC, :energy, bins=bins)

sx = collect(-5:0.01:10.0)
plt.plot(sx, sig_itp(sx), ".", label="Signal", color="xkcd:blue")
plt.plot(sx, sig_kde(sx), label="SignalKDE", color="xkcd:blue" )
plt.plot(sx, bkg_itp(sx), ".", label="Background", color="xkcd:green")
plt.plot(sx, bkg_kde(sx), label="BackgroundKDE", color="xkcd:green" )
plt.hist(signalMC.energy, density=true, bins=bins, label="Signal Hist", color="xkcd:aqua")
plt.legend()
plt.show()

In [None]:
# Sum vs integral
dx = 0.1
x = collect(-10:dx:10)
integral = sum( sig_itp.(x)*dx )

### 1D component fit: Signal v Background

In [None]:
# Put together a custom model to fit to the generated dataset

model = CustomModel()

## Free parameters
s = Parameter("signal"; init=1000.0)
b = Parameter("bkg"; init=100.0)

## Data counts
observation_count = Constant("obs", Float64( size(data.energy,1) ))

## Build the PDF functions
add_function(:sig_itp, sig_itp)
add_function(:bkg_itp, bkg_itp)

@addfunction bshape() = @. bkg_itp(data.energy)
@addfunction sshape() = @. sig_itp(data.energy)

@addfunction spectralpdf(s, b) = begin
    -sum(log.(b*bshape() .+ s*sshape() ))
end
#@addfunction extended(x...) = begin
#    sum([x...])
#end
@addfunction logextend(n, x...) = begin
    λ = sum([x...])
    λ + n*log(n) - n
end

## Construct NLL
ob1 = NLogPDF("spectralpdf", s, b)
ob2 = NLogPDF("logextend", observation_count, s, b)

## Include the above in our model
add_parameters!(model, [s, b, observation_count])
add_nlogpdfs!(model, [ob1, ob2])
@show model.lower_bounds
println("Fitting")
options = Dict(
    "ftol_abs"=>0,
    "ftol_rel"=>1e-6,
)
results = minimize!(model; options=options)
#println("Profiling")
profile!("signal", results)

compute_profiled_uncertainties!(results; σ=1)
pretty_results(results)

In [None]:
## Plot spectra
p = getparam(model, "signal")
#signal_y = getparam(model, "signal").fit * sig_itp(bins) * (bins[2]-bins[1])
#bkg_y = getparam(model, "bkg").fit * bkg_itp(bins)* (bins[2]-bins[1])
signal_y = getparam(model, "signal").fit * sig_itp(sx) * (bins[2]-bins[1])
bkg_y = getparam(model, "bkg").fit * bkg_itp(sx)* (bins[2]-bins[1])
plt.plot(sx, signal_y+bkg_y, label="Total", color="black")
plt.plot(sx, signal_y, label="signal")
plt.plot(sx, bkg_y, label="bkg")
plt.hist(data.energy, bins=bins, label="Data")
plt.legend()

In [None]:
using PyPlot
plt.style.use("bat.mplstyle")

hs = x -> x >= 0 ? 1 : 0
#profile!("Signal", results; prior=nothing)
#uncertainty!("Signal", results )

interval_plot(results, "signal")
plt.savefig("profile.svg")
plt.show()

In [None]:
using PyPlot
correlation_plots(results)
plt.show()

In [None]:
correlation_plots2(results)

In [None]:
## Bias/Pull Testing

bias_vector = []
pull_vector = []
trials = collect(1:10)

errors = 0
for t in trials
    try
        mock_dataset()
        results = minimize!(model; options=options)
        profile!("signal", results)
        uncertainty!("signal", results;σ=1)
        sig_stats = getparam(model, "signal")
        bias = sig_stats.fit - signal_truth
        if bias >= 0
            pull = bias / abs(sig_stats.fit - sig_stats.low)
        else
            pull = bias / abs(sig_stats.fit - sig_stats.high)
        end
        push!(bias_vector, bias)
        push!(pull_vector, pull)
    catch e
        errors += 1
    end
    print(t,"\r")
end
println("Failure rate: ", errors/maximum(trials))

In [None]:
@show mean = sum(bias_vector) / length(bias_vector)
plt.hist(bias_vector, bins=20);

In [None]:
@show mean = sum(pull_vector)/length(pull_vector)
@show dev = sqrt(sum((pull_vector.-mean).^2)/length(pull_vector))

plt.hist(pull_vector, bins=collect(-3:0.1:3))

println("Pull distribution: ", mean, " +- ", dev);