-
-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add population analysis with error bars across population for continuous plots #30
Changes from 21 commits
89d83f2
9bcef97
85b0f40
203120c
da11d97
dce15d2
1e6921b
4c40585
5a8319b
4c18e6a
c972da8
3f51641
29bd4f3
4764641
fd10f37
e7c4280
b06ae4b
2d10288
1a874c6
899c87a
41a2251
dc7e1d9
4d65fb7
ff638a5
3ae3845
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,3 +6,4 @@ StatsBase | |
Distributions | ||
DataFrames | ||
KernelDensity | ||
Loess | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,11 +9,15 @@ using Distributions | |
using DataFrames | ||
|
||
import KernelDensity | ||
import Loess | ||
@recipe f(k::KernelDensity.UnivariateKDE) = k.x, k.density | ||
@recipe f(k::KernelDensity.BivariateKDE) = k.x, k.y, k.density | ||
|
||
@shorthands cdensity | ||
|
||
export groupapply | ||
export get_summary | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need to export There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have strong opinions on that. The actual plotting is done as in the README: grp_error = groupapply(:cumulative, school, :MAch; compute_error = (:across,:School), group = :Sx)
plot(grp_error, line = :path) and in practice I never really use it. Is it bad to export it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this case it doesn't seem necessary, but I admit I don't actually know when you'd call it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was proposed by @mkborregaard. The idea is that in the normal scenario, you would have a dataframe that you'd split according to a column and the summarize the split data (for example getting mean and sem). The use case would be somebody having obtained a grouped dataframe already by some other means and wanting to get the summary, which is not possible with only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, on a second thought I'm in favour of keeping it also due to performance reasons: for large datasets it may be inconvenient to have to group it every time. In the implementation of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes exatly, that was the idea. I think it should be renamed to something more specific - "get_summary" is too generic a name to put in the user's workspace for a very specific function used for one type of plot only |
||
|
||
include("dataframes.jl") | ||
include("corrplot.jl") | ||
include("cornerplot.jl") | ||
|
@@ -24,5 +28,8 @@ include("hist.jl") | |
include("marginalhist.jl") | ||
include("bar.jl") | ||
include("shadederror.jl") | ||
include("groupederror.jl") | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you should export the groupapply function |
||
|
||
end # module |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,7 +53,7 @@ grouped_xy(y::AbstractMatrix) = 1:size(y,1), y | |
end | ||
fr | ||
else | ||
get(d, :fillrange, 0) | ||
get(d, :fillrange, nothing) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remind me why you are throwing away the fillrange value here for non-stacked bars? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a fix to: #32 In case of errorbars, the fillrange to 0 also gets applied to the error series and creates those messy things of the issue (on PlotlyJS). Without any fillrange it seems to work just fine. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm I worry what that may have of unintended consequences. Does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that also works. I'll go for |
||
end | ||
|
||
seriestype := :bar | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,308 @@ | ||
##### List of functions to analyze data | ||
|
||
function extend_axis(df::AbstractDataFrame, xlabel, ylabel, xaxis, val) | ||
aux = DataFrame() | ||
aux[xlabel] = xaxis | ||
extended = join(aux, df, on = xlabel, kind = :left) | ||
sort!(extended, cols = [xlabel]) | ||
return convert(Array, extended[ylabel], val) | ||
end | ||
|
||
function extend_axis(xsmall, ysmall, xaxis, val) | ||
df = DataFrame(x = xsmall, y = ysmall) | ||
return extend_axis(df, :x, :y, xaxis, val) | ||
end | ||
|
||
""" | ||
`_locreg(df, xaxis::LinSpace, x, y; kwargs...)` | ||
|
||
Apply loess regression, training the regressor with `x` and `y` and | ||
predicting `xaxis` | ||
""" | ||
function _locreg(df, xaxis::LinSpace, x, y; kwargs...) | ||
predicted = fill(NaN,length(xaxis)) | ||
within = minimum(df[x]).<= xaxis .<= maximum(df[x]) | ||
if any(within) | ||
model = Loess.loess(convert(Vector{Float64},df[x]),convert(Vector{Float64},df[y]); kwargs...) | ||
predicted[within] = Loess.predict(model,xaxis[within]) | ||
end | ||
return predicted | ||
end | ||
|
||
|
||
""" | ||
`_locreg(df, xaxis, x, y; kwargs...)` | ||
|
||
In the discrete case, the function computes the conditional expectation of `y` for | ||
a given value of `x` | ||
""" | ||
function _locreg(df, xaxis, x, y) | ||
ymean = by(df, x) do dd | ||
DataFrame(m = mean(dd[y])) | ||
end | ||
return extend_axis(ymean, x, :m, xaxis, NaN) | ||
end | ||
|
||
""" | ||
`_density(df,xaxis::LinSpace, x; kwargs...)` | ||
|
||
Kernel density of `x`, computed along `xaxis` | ||
""" | ||
_density(df,xaxis::LinSpace, x; kwargs...) = pdf(KernelDensity.kde(df[x]; kwargs...),xaxis) | ||
|
||
""" | ||
`_density(df, xaxis, x)` | ||
|
||
Normalized histogram of `x` (which is discrete: every value is its own bin) | ||
""" | ||
function _density(df,xaxis, x) | ||
xhist = by(df, x) do dd | ||
DataFrame(length = size(dd,1)/size(df,1)) | ||
end | ||
return extend_axis(xhist, x, :length, xaxis, 0.) | ||
end | ||
|
||
""" | ||
`_cumulative(df, xaxis, x) = ecdf(df[x])(xaxis)` | ||
|
||
Cumulative density function of `x`, computed along `xaxis` | ||
""" | ||
_cumulative(df, xaxis, x) = ecdf(df[x])(xaxis) | ||
|
||
|
||
#### Method to compute and plot grouped error plots using the above functions | ||
|
||
type GroupedError{S, T<:AbstractString} | ||
x::Vector{Vector{S}} | ||
y::Vector{Vector{Float64}} | ||
err::Vector{Vector{Float64}} | ||
group::Vector{T} | ||
axis_type::Symbol | ||
show_error::Bool | ||
end | ||
|
||
get_axis(column) = sort!(union(column)) | ||
get_axis(column, npoints) = linspace(minimum(column),maximum(column),npoints) | ||
|
||
# f is the function used to analyze dataset: define it as nan when it is not defined, | ||
# the input is: dataframe used, points chosen on the x axis, x (and maybe y) column labels | ||
# the output is the y value for the given xvalues | ||
|
||
errortype(s::Symbol) = s | ||
errortype(s) = s[1] | ||
|
||
function newsymbol(s, l::AbstractArray{Symbol}) | ||
ns = s | ||
while ns in l | ||
ns = Symbol(ns,:_) | ||
end | ||
return ns | ||
end | ||
|
||
newsymbol(s, df::AbstractDataFrame) = newsymbol(s, names(df)) | ||
|
||
|
||
""" | ||
get_summary(trend,variation, f, splitdata::GroupedDataFrame, xvalues, args...; kwargs...) | ||
|
||
Apply function `f` to `splitdata`, then compute summary statistics | ||
`trend` and `variation` of those values. A shared x axis `xvalues` is needed: use | ||
`LinSpace` for continuous x variable and a normal vector for the discrete case. Remaining arguments | ||
are label of x axis variable and extra arguments for function `f`. `kwargs...` are passed | ||
to `f` | ||
""" | ||
function get_summary(trend,variation, f, splitdata::GroupedDataFrame, xvalues, args...; kwargs...) | ||
v = Array(Float64, length(xvalues), length(splitdata)); | ||
for i in 1:length(splitdata) | ||
v[:,i] = f(splitdata[i],xvalues, args...; kwargs...) | ||
end | ||
mean_across_pop = Array(Float64, length(xvalues)); | ||
sem_across_pop = Array(Float64, length(xvalues)); | ||
for j in 1:length(xvalues) | ||
notnan = !isnan(v[j,:]) | ||
mean_across_pop[j] = trend(v[j,notnan]) | ||
sem_across_pop[j] = variation(v[j,notnan]) | ||
end | ||
valid = !isnan(mean_across_pop) & !isnan(sem_across_pop) | ||
return xvalues[valid], mean_across_pop[valid], sem_across_pop[valid] | ||
end | ||
|
||
""" | ||
get_summary(trend,variation, f, df::AbstractDataFrame, ce, axis_type, args...; kwargs...) | ||
|
||
Get GropedDataFrame from `df` according to `ce`. `ce = (:across, col_name)` will split | ||
across column `col_name`, whereas `ce = (:bootstrap, n_samples)` will generate `n_samples` | ||
fake datasets distribute like the real dataset (nonparametric bootstrapping). Choose shared axis | ||
according to `axis_type` (`:continuous` or `:discrete`) then compute `get_summary` | ||
""" | ||
function get_summary(trend,variation, f, df::AbstractDataFrame, ce, axis_type, args...; kwargs...) | ||
# define points on x axis | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. better |
||
if axis_type == :discrete | ||
xvalues = get_axis(df[args[1]]) | ||
elseif axis_type == :continuous | ||
xvalues = get_axis(df[args[1]], 100) | ||
else | ||
error("Unexpected axis_type: only :discrete and :continuous allowed!") | ||
end | ||
|
||
if ce == :none | ||
mean_across_pop = f(df,xvalues, args...; kwargs...) | ||
sem_across_pop = zeros(length(xvalues)); | ||
valid = ~isnan(mean_across_pop) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ups, apparently programming in matlab left a mark on me. I guess the two are equivalent, but I'll be consistent with the rest of the world and use |
||
return xvalues[valid], mean_across_pop[valid], sem_across_pop[valid] | ||
elseif ce[1] == :across | ||
# get mean value and sem of function of interest | ||
splitdata = groupby(df, ce[2]) | ||
return get_summary(trend,variation, f, splitdata, xvalues, args...; kwargs...) | ||
elseif ce[1] == :bootstrap | ||
n_samples = ce[2] | ||
indexes = Array(Int64,0) | ||
split_var = Array(Int64,0) | ||
for i = 1:n_samples | ||
append!(indexes, rand(1:size(df,1),size(df,1))) | ||
append!(split_var, fill(i,size(df,1))) | ||
end | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This way of dealing with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually with real data these plots tend to have huge error bars towards the end, which is somewhat annoying, but it could also be that one has a lousy subject with a weird range and you don't want that to prevent you from looking at most of your data. As of now I require at least 2 subject to be not NaN to plot the mean and sem but that could be another keyword (in terms of number of subjects or fraction of subjects required to consider it a "plottable" point). In principle one could even think to show with some vertical lines or sth else where is the majority of the data (say from 0.05 to 0.95 quantile). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No I think it is fine to leave it as is. |
||
split_col = newsymbol(:split_col, df) | ||
bootstrap_data = df[indexes,:] | ||
bootstrap_data[split_col] = split_var | ||
ends = collect(size(df,1)*(1:n_samples)) | ||
starts = ends - size(df,1) + 1 | ||
splitdata = GroupedDataFrame(bootstrap_data,[split_col],collect(1:ends[end]), starts, ends) | ||
return get_summary(trend,variation, f, splitdata, xvalues, args...; kwargs...) | ||
else | ||
error("compute_error can only be equal to :none, :across, | ||
(:across, col_name), :bootstrap or (:bootstrap, n_samples)") | ||
end | ||
end | ||
|
||
""" | ||
groupapply(f::Function, df, args...; axis_type = :auto, across = [], group = [], summarize = (mean, sem), kwargs...) | ||
|
||
Split `df` by `group`. Then apply `get_summary` to get a population summary of the grouped data. | ||
Output is a `GroupedError` with error computed according to the keyword `compute_error`. | ||
It can be plotted using `plot(g::GroupedError)` | ||
Seriestype can be specified to be `:path`, `:scatter` or `:bar` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a few typos here. Also, the |
||
""" | ||
function groupapply(f::Function, df, args...; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. docstring please |
||
axis_type = :auto, compute_error = :none, group = [], | ||
summarize = (errortype(compute_error) == :bootstrap) ? (mean, std) : (mean, sem), kwargs...) | ||
if !(axis_type in [:discrete, :continuous]) | ||
axis_type = (typeof(df[args[1]])<:PooledDataArray) ? :discrete : :continuous | ||
end | ||
if (axis_type == :continuous) & !(eltype(df[args[1]])<:Real) | ||
warn("Changing to discrete axis, x values are not real numbers!") | ||
axis_type = :discrete | ||
end | ||
mutated_xtype = (axis_type == :continuous) ? Float64 : eltype(df[args[1]]) | ||
|
||
# Add default for :across and :bootstrap | ||
if compute_error == :across | ||
row_name = newsymbol(:rows, df) | ||
df[row_name] = 1:size(df,1) | ||
ce = (:across, row_name) | ||
elseif compute_error == :bootstrap | ||
ce = (:bootstrap, 1000) | ||
else | ||
ce = compute_error | ||
end | ||
|
||
g = GroupedError( | ||
Array(Vector{mutated_xtype},0), | ||
Array(Vector{Float64},0), | ||
Array(Vector{Float64},0), | ||
Array(AbstractString,0), | ||
axis_type, | ||
ce != :none | ||
) | ||
if group == [] | ||
xvalues,yvalues,shade = get_summary(summarize..., f, df, ce, axis_type, args...; kwargs...) | ||
push!(g.x, xvalues) | ||
push!(g.y, yvalues) | ||
push!(g.err, shade) | ||
push!(g.group, "") | ||
else | ||
#group_array = isa(group, AbstractArray) ? group : [group] | ||
by(df,group) do dd | ||
label = isa(group, AbstractArray) ? | ||
string(["$(dd[1,column]) " for column in group]...) : string(dd[1,group]) | ||
xvalues,yvalues,shade = get_summary(summarize...,f, dd, ce, axis_type, args...; kwargs...) | ||
push!(g.x, xvalues) | ||
push!(g.y, yvalues) | ||
push!(g.err, shade) | ||
push!(g.group, label) | ||
return | ||
end | ||
end | ||
if compute_error == :across; delete!(df, row_name); end | ||
|
||
return g | ||
end | ||
|
||
builtin_funcs = Dict(zip([:locreg, :density, :cumulative], [_locreg, _density, _cumulative])) | ||
|
||
""" | ||
groupapply(s::Symbol, df, args...; kwargs...) | ||
|
||
`s` can be `:locreg`, `:density` or `:cumulative`, in which case the corresponding built in | ||
analysis function is used. `s` can also be a symbol of a column of `df`, in which case the call | ||
is equivalent to `groupapply(:locreg, df, args[1], s; kwargs...)` | ||
""" | ||
function groupapply(s::Symbol, df, args...; kwargs...) | ||
if s in keys(builtin_funcs) | ||
analysisfunction = builtin_funcs[s] | ||
return groupapply(analysisfunction, df, args...; kwargs...) | ||
else | ||
return groupapply(_locreg, df, args[1], s; kwargs...) | ||
end | ||
end | ||
|
||
""" | ||
groupapply(df::AbstractDataFrame, x, y; kwargs...) | ||
|
||
Equivalent to `groupapply(:locreg, df::AbstractDataFrame, x, y; kwargs...)` | ||
""" | ||
|
||
groupapply(df::AbstractDataFrame, x, y; kwargs...) = groupapply(_locreg, df, x, y; kwargs...) | ||
|
||
@recipe function f(g::GroupedError) | ||
if !(:seriestype in keys(d)) || (d[:seriestype] == :path) | ||
for i = 1:length(g.group) | ||
@series begin | ||
seriestype := :shadederror | ||
x := cycle(g.x,i) | ||
y := cycle(g.y, i) | ||
shade := cycle(g.err,i) | ||
label --> cycle(g.group,i) | ||
() | ||
end | ||
end | ||
elseif d[:seriestype] == :scatter | ||
for i = 1:length(g.group) | ||
@series begin | ||
seriestype := :scatter | ||
x := cycle(g.x,i) | ||
y := cycle(g.y, i) | ||
if g.show_error | ||
err := cycle(g.err,i) | ||
end | ||
label --> cycle(g.group,i) | ||
() | ||
end | ||
end | ||
elseif d[:seriestype] == :bar | ||
if g.axis_type == :continuous | ||
warn("Bar plot with continuous x axis doesn't make sense!") | ||
end | ||
xaxis = sort!(union((g.x)...)) | ||
ys = extend_axis.(g.x, g.y, [xaxis], [NaN]) | ||
y = hcat(ys...) | ||
if g.show_error | ||
errs = extend_axis.(g.x, g.err, [xaxis], [NaN]) | ||
err := hcat(errs...) | ||
end | ||
label --> hcat(g.group...) | ||
StatPlots.GroupedBar((xaxis,y)) | ||
end | ||
end | ||
|
||
@shorthands GroupedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding another package here is a bit of debate - I would say that StatPlots should not be afraid of including statistical packages (I guess that is one reason to keep it separate from Plots) and I also think it should depend on GLM - what is your philosophy here, @tbreloff ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this comment. Yes adding dependencies should not be done lightly, but this is probably acceptable.