Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add population analysis with error bars across population for continuous plots #30

Merged
merged 25 commits into from
Jan 18, 2017
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
89d83f2
Added Bar Plots and Scatter Plots
Dec 1, 2016
9bcef97
Merge branch 'master' of https://github.com/JuliaPlots/StatPlots.jl
Dec 2, 2016
85b0f40
Added keyword version, also with cyclic keywords and automated label
Dec 9, 2016
203120c
Added set of standard analysis
Dec 9, 2016
da11d97
Changed NaN with NAs
Dec 10, 2016
dce15d2
Fixed bug on sorting categorical x axis
Dec 10, 2016
1e6921b
Fixed bug on sorting categorical x axis
Dec 10, 2016
4c40585
Added shortcuts, replaced kernelestimator with loess
Dec 15, 2016
5a8319b
Merge branch 'master' of https://github.com/piever/StatPlots.jl
Dec 15, 2016
4c18e6a
trying groupapply
Dec 17, 2016
c972da8
Added new groupederror type
Dec 18, 2016
3f51641
Fixed groupedbar on plotlyjs, added description to readme
Dec 19, 2016
29bd4f3
corrected couple of mistakes in readme
Dec 19, 2016
4764641
Fixed issue with label in Plotlyjs
Dec 19, 2016
fd10f37
Cleaned up axis selection, added docstrings
Dec 26, 2016
e7c4280
Updated readme
Dec 26, 2016
b06ae4b
Fixed error in case of subject with one datapoint!
Dec 31, 2016
2d10288
removed unnecessary array conversion
Jan 1, 2017
1a874c6
Unified files into groupederror, updated readme
Jan 10, 2017
899c87a
Implemented bootstrap error
Jan 10, 2017
41a2251
Updated README and docs
Jan 10, 2017
dc7e1d9
Corrected typos
Jan 10, 2017
4d65fb7
Updated groupapply docstrings
Jan 12, 2017
ff638a5
Temporary: minor refactoring of the data
Jan 16, 2017
3ae3845
Renamed get_summary to get_groupederror and corrected docstrings
Jan 17, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,3 +121,47 @@ groupedbar(rand(10,3), bar_position = :dodge, bar_width=0.7)
```

![tmp](https://cloud.githubusercontent.com/assets/933338/18962092/673f6c78-863d-11e6-9ee9-8ca104e5d2a3.png)


## groupapply for population analysis
There is a groupapply function that splits the data across a keyword argument "group", then applies "summarize" to get average and variability of a given analysis (density, cumulative and local regression are supported so far, but one can also add their own function). To get average and variability there are 3 ways:

- `compute_error = (:across, col_name)`, where the data is split according to column `col_name` before being summarized. `compute_error = :across` splits across all observations. Default summary is `(mean, sem)` but it can be changed with keyword `summarize` to any pair of functions.

- `compute_error = (:bootstrap, n_samples)`, where `n_samples` fake datasets are generated using nonparametric bootstrapping and then summarized. `compute_error = :bootstrap` defaults to `compute_error = (:bootstrap, 1000)`. Default summary is `(mean, std)`. This method will work with any analysis but is computationally very expensive.

- `compute_error = :none`, where no error is computed or displayed and the analysis is carried out normally.

The local regression uses [Loess.jl](https://github.com/JuliaStats/Loess.jl) and the density plot uses [KernelDensity.jl](https://github.com/JuliaStats/KernelDensity.jl). In case of categorical x variable, these function are computed by splitting the data across the x variable and then computing the density/average per bin. The choice of continuous or discrete axis can be forced via `axis_type = :continuous` or `axis_type = :discrete`

Example use:

```julia
using DataFrames
import RDatasets
using StatPlots
gr()
school = RDatasets.dataset("mlmRev","Hsb82");
grp_error = groupapply(:cumulative, school, :MAch; compute_error = (:across,:School), group = :Sx)
plot(grp_error, line = :path)
```
<img width="494" alt="screenshot 2016-12-19 12 28 27" src="https://cloud.githubusercontent.com/assets/6333339/21313005/316e0f0c-c5e7-11e6-9464-f0921dee3d29.png">

Keywords for loess or kerneldensity can be given to groupapply:

```julia
df = groupapply(:density, school, :CSES; bandwidth = 1., compute_error = (:bootstrap,500), group = :Minrty)
plot(df, line = :path)
```

<img width="487" alt="screenshot 2017-01-10 18 36 48" src="https://cloud.githubusercontent.com/assets/6333339/21819500/cb788fb8-d763-11e6-89b9-91018f2b9a2a.png">


The bar plot

```julia
pool!(school, :Sx)
grp_error = groupapply(school, :Sx, :MAch; compute_error = :across, group = :Minrty)
plot(grp_error, line = :bar)
```
<img width="489" alt="screenshot 2017-01-10 18 20 51" src="https://cloud.githubusercontent.com/assets/6333339/21819413/7923681e-d763-11e6-907d-c81447b4cc99.png">
1 change: 1 addition & 0 deletions REQUIRE
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ StatsBase
Distributions
DataFrames
KernelDensity
Loess
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding another package here is a bit of debate - I would say that StatPlots should not be afraid of including statistical packages (I guess that is one reason to keep it separate from Plots) and I also think it should depend on GLM - what is your philosophy here, @tbreloff ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this comment. Yes adding dependencies should not be done lightly, but this is probably acceptable.

7 changes: 7 additions & 0 deletions src/StatPlots.jl
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,15 @@ using Distributions
using DataFrames

import KernelDensity
import Loess
@recipe f(k::KernelDensity.UnivariateKDE) = k.x, k.density
@recipe f(k::KernelDensity.BivariateKDE) = k.x, k.y, k.density

@shorthands cdensity

export groupapply
export get_summary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to export get_summary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinions on that. The actual plotting is done as in the README:

grp_error = groupapply(:cumulative, school, :MAch; compute_error = (:across,:School), group = :Sx)
plot(grp_error, line = :path)

and in practice I never really use it. Is it bad to export it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it doesn't seem necessary, but I admit I don't actually know when you'd call it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was proposed by @mkborregaard. The idea is that in the normal scenario, you would have a dataframe that you'd split according to a column and the summarize the split data (for example getting mean and sem). The use case would be somebody having obtained a grouped dataframe already by some other means and wanting to get the summary, which is not possible with only groupapply.
The pipeline is: groupapply calls get_summary who calls the analysis function on the subdataframes. groupapply, get_summary and the built-in analysis function have docstrings. The docstrings of groupapply refer to get_summary, so maybe the user is expecting that it should be exported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on a second thought I'm in favour of keeping it also due to performance reasons: for large datasets it may be inconvenient to have to group it every time. In the implementation of compute_error = :bootstrap I see a seizable performance improvement by working with GroupedDataFrames directly rather than putting everything in a big DataFrame and then applying groupby.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exatly, that was the idea. I think it should be renamed to something more specific - "get_summary" is too generic a name to put in the user's workspace for a very specific function used for one type of plot only


include("dataframes.jl")
include("corrplot.jl")
include("cornerplot.jl")
Expand All @@ -24,5 +28,8 @@ include("hist.jl")
include("marginalhist.jl")
include("bar.jl")
include("shadederror.jl")
include("groupederror.jl")


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should export the groupapply function


end # module
2 changes: 1 addition & 1 deletion src/bar.jl
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ grouped_xy(y::AbstractMatrix) = 1:size(y,1), y
end
fr
else
get(d, :fillrange, 0)
get(d, :fillrange, nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remind me why you are throwing away the fillrange value here for non-stacked bars?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fix to: #32 In case of errorbars, the fillrange to 0 also gets applied to the error series and creates those messy things of the issue (on PlotlyJS). Without any fillrange it seems to work just fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I worry what that may have of unintended consequences. Does get(d, :fillrange, nothing) work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that also works. I'll go for get(d, :fillrange, nothing) to stay as close as possible to the original design then.

end

seriestype := :bar
Expand Down
308 changes: 308 additions & 0 deletions src/groupederror.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
##### List of functions to analyze data

function extend_axis(df::AbstractDataFrame, xlabel, ylabel, xaxis, val)
aux = DataFrame()
aux[xlabel] = xaxis
extended = join(aux, df, on = xlabel, kind = :left)
sort!(extended, cols = [xlabel])
return convert(Array, extended[ylabel], val)
end

function extend_axis(xsmall, ysmall, xaxis, val)
df = DataFrame(x = xsmall, y = ysmall)
return extend_axis(df, :x, :y, xaxis, val)
end

"""
`_locreg(df, xaxis::LinSpace, x, y; kwargs...)`

Apply loess regression, training the regressor with `x` and `y` and
predicting `xaxis`
"""
function _locreg(df, xaxis::LinSpace, x, y; kwargs...)
predicted = fill(NaN,length(xaxis))
within = minimum(df[x]).<= xaxis .<= maximum(df[x])
if any(within)
model = Loess.loess(convert(Vector{Float64},df[x]),convert(Vector{Float64},df[y]); kwargs...)
predicted[within] = Loess.predict(model,xaxis[within])
end
return predicted
end


"""
`_locreg(df, xaxis, x, y; kwargs...)`

In the discrete case, the function computes the conditional expectation of `y` for
a given value of `x`
"""
function _locreg(df, xaxis, x, y)
ymean = by(df, x) do dd
DataFrame(m = mean(dd[y]))
end
return extend_axis(ymean, x, :m, xaxis, NaN)
end

"""
`_density(df,xaxis::LinSpace, x; kwargs...)`

Kernel density of `x`, computed along `xaxis`
"""
_density(df,xaxis::LinSpace, x; kwargs...) = pdf(KernelDensity.kde(df[x]; kwargs...),xaxis)

"""
`_density(df, xaxis, x)`

Normalized histogram of `x` (which is discrete: every value is its own bin)
"""
function _density(df,xaxis, x)
xhist = by(df, x) do dd
DataFrame(length = size(dd,1)/size(df,1))
end
return extend_axis(xhist, x, :length, xaxis, 0.)
end

"""
`_cumulative(df, xaxis, x) = ecdf(df[x])(xaxis)`

Cumulative density function of `x`, computed along `xaxis`
"""
_cumulative(df, xaxis, x) = ecdf(df[x])(xaxis)


#### Method to compute and plot grouped error plots using the above functions

type GroupedError{S, T<:AbstractString}
x::Vector{Vector{S}}
y::Vector{Vector{Float64}}
err::Vector{Vector{Float64}}
group::Vector{T}
axis_type::Symbol
show_error::Bool
end

get_axis(column) = sort!(union(column))
get_axis(column, npoints) = linspace(minimum(column),maximum(column),npoints)

# f is the function used to analyze dataset: define it as nan when it is not defined,
# the input is: dataframe used, points chosen on the x axis, x (and maybe y) column labels
# the output is the y value for the given xvalues

errortype(s::Symbol) = s
errortype(s) = s[1]

function newsymbol(s, l::AbstractArray{Symbol})
ns = s
while ns in l
ns = Symbol(ns,:_)
end
return ns
end

newsymbol(s, df::AbstractDataFrame) = newsymbol(s, names(df))


"""
get_summary(trend,variation, f, splitdata::GroupedDataFrame, xvalues, args...; kwargs...)

Apply function `f` to `splitdata`, then compute summary statistics
`trend` and `variation` of those values. A shared x axis `xvalues` is needed: use
`LinSpace` for continuous x variable and a normal vector for the discrete case. Remaining arguments
are label of x axis variable and extra arguments for function `f`. `kwargs...` are passed
to `f`
"""
function get_summary(trend,variation, f, splitdata::GroupedDataFrame, xvalues, args...; kwargs...)
v = Array(Float64, length(xvalues), length(splitdata));
for i in 1:length(splitdata)
v[:,i] = f(splitdata[i],xvalues, args...; kwargs...)
end
mean_across_pop = Array(Float64, length(xvalues));
sem_across_pop = Array(Float64, length(xvalues));
for j in 1:length(xvalues)
notnan = !isnan(v[j,:])
mean_across_pop[j] = trend(v[j,notnan])
sem_across_pop[j] = variation(v[j,notnan])
end
valid = !isnan(mean_across_pop) & !isnan(sem_across_pop)
return xvalues[valid], mean_across_pop[valid], sem_across_pop[valid]
end

"""
get_summary(trend,variation, f, df::AbstractDataFrame, ce, axis_type, args...; kwargs...)

Get GropedDataFrame from `df` according to `ce`. `ce = (:across, col_name)` will split
across column `col_name`, whereas `ce = (:bootstrap, n_samples)` will generate `n_samples`
fake datasets distribute like the real dataset (nonparametric bootstrapping). Choose shared axis
according to `axis_type` (`:continuous` or `:discrete`) then compute `get_summary`
"""
function get_summary(trend,variation, f, df::AbstractDataFrame, ce, axis_type, args...; kwargs...)
# define points on x axis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better xaxis = :auto

if axis_type == :discrete
xvalues = get_axis(df[args[1]])
elseif axis_type == :continuous
xvalues = get_axis(df[args[1]], 100)
else
error("Unexpected axis_type: only :discrete and :continuous allowed!")
end

if ce == :none
mean_across_pop = f(df,xvalues, args...; kwargs...)
sem_across_pop = zeros(length(xvalues));
valid = ~isnan(mean_across_pop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ~ not !?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, apparently programming in matlab left a mark on me. I guess the two are equivalent, but I'll be consistent with the rest of the world and use !

return xvalues[valid], mean_across_pop[valid], sem_across_pop[valid]
elseif ce[1] == :across
# get mean value and sem of function of interest
splitdata = groupby(df, ce[2])
return get_summary(trend,variation, f, splitdata, xvalues, args...; kwargs...)
elseif ce[1] == :bootstrap
n_samples = ce[2]
indexes = Array(Int64,0)
split_var = Array(Int64,0)
for i = 1:n_samples
append!(indexes, rand(1:size(df,1),size(df,1)))
append!(split_var, fill(i,size(df,1)))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of dealing with NaN means that confidence bands will all taper off towards the ends, right? I think that is valid enough, but might be more conservative to skip those x values with NaNs completely

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually with real data these plots tend to have huge error bars towards the end, which is somewhat annoying, but it could also be that one has a lousy subject with a weird range and you don't want that to prevent you from looking at most of your data. As of now I require at least 2 subject to be not NaN to plot the mean and sem but that could be another keyword (in terms of number of subjects or fraction of subjects required to consider it a "plottable" point). In principle one could even think to show with some vertical lines or sth else where is the majority of the data (say from 0.05 to 0.95 quantile).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think it is fine to leave it as is.

split_col = newsymbol(:split_col, df)
bootstrap_data = df[indexes,:]
bootstrap_data[split_col] = split_var
ends = collect(size(df,1)*(1:n_samples))
starts = ends - size(df,1) + 1
splitdata = GroupedDataFrame(bootstrap_data,[split_col],collect(1:ends[end]), starts, ends)
return get_summary(trend,variation, f, splitdata, xvalues, args...; kwargs...)
else
error("compute_error can only be equal to :none, :across,
(:across, col_name), :bootstrap or (:bootstrap, n_samples)")
end
end

"""
groupapply(f::Function, df, args...; axis_type = :auto, across = [], group = [], summarize = (mean, sem), kwargs...)

Split `df` by `group`. Then apply `get_summary` to get a population summary of the grouped data.
Output is a `GroupedError` with error computed according to the keyword `compute_error`.
It can be plotted using `plot(g::GroupedError)`
Seriestype can be specified to be `:path`, `:scatter` or `:bar`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few typos here. Also, the summarize = (mean, sem) argument seems to be at odds with "Then apply get_mean_sem", as that would change with different summarize, right?

"""
function groupapply(f::Function, df, args...;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring please

axis_type = :auto, compute_error = :none, group = [],
summarize = (errortype(compute_error) == :bootstrap) ? (mean, std) : (mean, sem), kwargs...)
if !(axis_type in [:discrete, :continuous])
axis_type = (typeof(df[args[1]])<:PooledDataArray) ? :discrete : :continuous
end
if (axis_type == :continuous) & !(eltype(df[args[1]])<:Real)
warn("Changing to discrete axis, x values are not real numbers!")
axis_type = :discrete
end
mutated_xtype = (axis_type == :continuous) ? Float64 : eltype(df[args[1]])

# Add default for :across and :bootstrap
if compute_error == :across
row_name = newsymbol(:rows, df)
df[row_name] = 1:size(df,1)
ce = (:across, row_name)
elseif compute_error == :bootstrap
ce = (:bootstrap, 1000)
else
ce = compute_error
end

g = GroupedError(
Array(Vector{mutated_xtype},0),
Array(Vector{Float64},0),
Array(Vector{Float64},0),
Array(AbstractString,0),
axis_type,
ce != :none
)
if group == []
xvalues,yvalues,shade = get_summary(summarize..., f, df, ce, axis_type, args...; kwargs...)
push!(g.x, xvalues)
push!(g.y, yvalues)
push!(g.err, shade)
push!(g.group, "")
else
#group_array = isa(group, AbstractArray) ? group : [group]
by(df,group) do dd
label = isa(group, AbstractArray) ?
string(["$(dd[1,column]) " for column in group]...) : string(dd[1,group])
xvalues,yvalues,shade = get_summary(summarize...,f, dd, ce, axis_type, args...; kwargs...)
push!(g.x, xvalues)
push!(g.y, yvalues)
push!(g.err, shade)
push!(g.group, label)
return
end
end
if compute_error == :across; delete!(df, row_name); end

return g
end

builtin_funcs = Dict(zip([:locreg, :density, :cumulative], [_locreg, _density, _cumulative]))

"""
groupapply(s::Symbol, df, args...; kwargs...)

`s` can be `:locreg`, `:density` or `:cumulative`, in which case the corresponding built in
analysis function is used. `s` can also be a symbol of a column of `df`, in which case the call
is equivalent to `groupapply(:locreg, df, args[1], s; kwargs...)`
"""
function groupapply(s::Symbol, df, args...; kwargs...)
if s in keys(builtin_funcs)
analysisfunction = builtin_funcs[s]
return groupapply(analysisfunction, df, args...; kwargs...)
else
return groupapply(_locreg, df, args[1], s; kwargs...)
end
end

"""
groupapply(df::AbstractDataFrame, x, y; kwargs...)

Equivalent to `groupapply(:locreg, df::AbstractDataFrame, x, y; kwargs...)`
"""

groupapply(df::AbstractDataFrame, x, y; kwargs...) = groupapply(_locreg, df, x, y; kwargs...)

@recipe function f(g::GroupedError)
if !(:seriestype in keys(d)) || (d[:seriestype] == :path)
for i = 1:length(g.group)
@series begin
seriestype := :shadederror
x := cycle(g.x,i)
y := cycle(g.y, i)
shade := cycle(g.err,i)
label --> cycle(g.group,i)
()
end
end
elseif d[:seriestype] == :scatter
for i = 1:length(g.group)
@series begin
seriestype := :scatter
x := cycle(g.x,i)
y := cycle(g.y, i)
if g.show_error
err := cycle(g.err,i)
end
label --> cycle(g.group,i)
()
end
end
elseif d[:seriestype] == :bar
if g.axis_type == :continuous
warn("Bar plot with continuous x axis doesn't make sense!")
end
xaxis = sort!(union((g.x)...))
ys = extend_axis.(g.x, g.y, [xaxis], [NaN])
y = hcat(ys...)
if g.show_error
errs = extend_axis.(g.x, g.err, [xaxis], [NaN])
err := hcat(errs...)
end
label --> hcat(g.group...)
StatPlots.GroupedBar((xaxis,y))
end
end

@shorthands GroupedError
Loading