Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scatterplot matrix #286

Open
johansigfrids opened this issue May 4, 2014 · 12 comments
Open

Scatterplot matrix #286

johansigfrids opened this issue May 4, 2014 · 12 comments

Comments

@johansigfrids
Copy link

Would it be possible for Gadfly to support creating scatterplot matrices? Like http://home.centurytel.net/~mjm/NScatterplotMatrix.gif

@dcjones
Copy link
Collaborator

dcjones commented May 5, 2014

This is something for which there should be a nice shortcut (like there is for plotting functions, etc). Here's a goofy way can fake in the mean time:

using Gadfly, Compose, DataFrames
df = DataFrame(x=rand(100), y=rand(100), z=rand(100))
n = size(df, 2)
cs = Array(Canvas, (n, n))
for i in 1:n
    for j in 1:n
        cs[i, j] = render(plot(x=df[i], y=df[j]))
    end
end
draw(png("plotmatrix.png", 8inch, 8inch), gridstack(cs))

plotmatrix

@johansigfrids
Copy link
Author

How would I render it in a IJulia notebook?

@dcjones
Copy link
Collaborator

dcjones commented May 6, 2014

Just removing the filename from a call to PNG or SVG should do it. E.g.

draw(PNG(8inch, 8inch), gridstack(cs))

@johansigfrids
Copy link
Author

Nice. Even work with D3 for some zooming and panning action.

@dcjones
Copy link
Collaborator

dcjones commented Dec 23, 2014

It's possible to make a nicer version of this now with Geom.subplot_grid, but I'm thinking we should have some sort of shortcut, since it's a common type of plot, but one that's not particularly easy to make in GoG-style plotting systems. Not exactly sure what that should look like yet.

@johansigfrids
Copy link
Author

I've been making used of a hand rolled function, but it is not very elegant or flexible. Lots of hard coded values and such.

using Gadfly, Compose, DataFrames
x=randn(100)
df = DataFrame(firstvar=x, secondvar=x.+randn(100).+100, thirdvar=x.-randn(100), fourthvar=randn(100))

function scatterplot_matrix(df::AbstractDataFrame, cols::Vector{Symbol})
    n = length(cols)
    cs = Array(Context, (n, n))
    for (i, coli) in enumerate(cols), (j, colj) in enumerate(cols)
        if i==j
            cs[i, j] = compose(context(), text(0.5, 0.5, string(coli), hcenter, vcenter), linewidth(0.01mm), 
            stroke(color("black")), fontsize(5))
        else
            cs[i, j] = render(plot(x=df[coli], y=df[colj], Guide.xlabel(nothing), Guide.ylabel(nothing), 
                Guide.yticks(label=false), Guide.xticks(label=false)))
        end
    end
    draw(SVGJS((5*n)cm, (5*n)cm), gridstack(cs))
end

scatterplot_matrix(df, [:firstvar, :secondvar, :thirdvar, :fourthvar])

@e-neu
Copy link

e-neu commented Mar 18, 2016

Hi,

I just want to follow up on this question and ask whether there was some progress in making available a shortcut to this type of plot? Also I would like to suggest a slightly different representation. Since the information at the upper and lower triangle are basically redundant one could replace one triangle with the correlation coefficients between the two variables i,j. Also, since the outcome on the diagonal is quite obvious one could replace them with the histograms of the variable in question. I've seen this type of visualization in¹ and also used it myself. An example prepared with matlab2tikz is attached. Here I had two clusters. The (default) behavior of a generic scatterplot matrix would of course show only one color and a joined histogram.

¹Hair, Joseph F. "Multivariate data analysis." (2010).

screenshot from 2016-03-18 10-10-43

@tbreloff
Copy link
Contributor

See corrplot in MLPlots/Plots: https://github.com/JuliaML/MLPlots.jl

You can use Gadfly as a backend if you want.

On Mar 18, 2016, at 5:27 AM, Eugen Neu notifications@github.com wrote:

Hi,

I just want to follow up on this question and ask whether there was some progress in making available a shortcut to this type of plot? Also I would like to suggest a slightly different representation. Since the information at the upper and lower triangle are basically redundant one could replace one triangle with the correlation coefficients between the two variables i,j. Also, since the outcome on the diagonal is quite obvious one could replace them with the histograms of the variable in question. I've seen this type of visualization in¹ and also used it myself. An example prepared with matlab2tikz is attached. Here I had two clusters. The (default) behavior of a generic scatterplot matrix would of course show only one color and a joined histogram.

¹Hair, Joseph F. "Multivariate data analysis." (2010).


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

@e-neu
Copy link

e-neu commented Mar 19, 2016

Thanks for this, @tbreloff. Looks good.

Still, something like Geom.scattermatrix would be great.

@psilva07
Copy link

I needed this as part of my research and decided to write a fairly reusable function, to reproduce what @e-neu described, so I'm sharing it in case someone else needs something similar. It's probably not symmetrical enough that you could use in a publication due to the way Gadfly handles stacked plots with/out axis information but it is good enough for inspection.

function scatterMatrix(olddf, colorido=[], legenda=true)
    df = olddf[complete_cases(olddf),:]
    n = size(df, 2)
    nomes = names(df)
    if colorido!=[]
        n = n-1
        splice!(nomes,findfirst(x -> string(x) == colorido, nomes))
    end
    M = Array(Compose.Context, (n,n))

    for (i,indexi) in zip(nomes,1:length(nomes))
        nowcor = false
        for (j,indexj) in zip(nomes,1:length(nomes))
            gdplot = Geom.point
            xTickMarks=yTickMarks=false
            xName=yName=""
            kps=:none
            if j == nomes[1]
                yName=string(i)
                yTickMarks=true
            end
            if i == nomes[end]
                xName=string(j)
                xTickMarks=true
            end
            if nowcor#Cor Info
                index1 = complete_cases(df[:,[i,j]])
                text0 = "Corr: "*string(trunc(cor(df[index1,i],df[index1,j]),4))
                M[indexi,indexj] = compose(context(),
                (context(), text(0.5, 0.5, text0, hcenter)),
                (context(0.1w, 0.1h, 0.8w, 0.8h), rectangle(), fill("white"), stroke("black")))#))
            elseif i == j #histogram
                gdplot = Geom.histogram
                if legenda
                    kps=:right
                end
                M[indexi,indexj] = render(plot(df, x=string(j), y=string(i), color=colorido, gdplot(maxbincount=20),
                Guide.xlabel(xName), Guide.ylabel(yName), Guide.xticks(label=xTickMarks), Guide.yticks(label=yTickMarks), 
                Theme(grid_line_width=1pt, panel_stroke=colorant"black", key_position=kps)))
                nowcor=true
            else #scatterplots
                M[indexi,indexj] = render(plot(df, x=string(j), y=string(i), color=colorido, gdplot,
                Guide.xlabel(xName), Guide.ylabel(yName), Guide.xticks(label=xTickMarks), Guide.yticks(label=yTickMarks), 
                Theme(panel_stroke=colorant"black", key_position=kps)))
            end
        end
    end
    return gridstack(M)
end

The first parameter is the Dataframe in question, the second is the name of the column to be used for the color information and the third is a boolean indicating whether or not there should be key information on the plots.

The function assumes all the values on the Dataframe are numeric for the calculation of the correlation values (with the exception of the column used for color information, which does not appear on the plots).

Here is an example of how to use it:

using RDatasets
iris = dataset("datasets", "iris")
scmIris = scatterMatrix(iris, "Species", false)

plotmatrix

@Mattriks
Copy link
Member

Here is an attempt at a (near) publication-quality scatterplot matrix for Gadfly. You can find this on the branch scatplotmat at http://github.com/Mattriks/Gadfly.jl. I'll do a PR soon.

The function scatterplotmatdata, rather than creating the plot, simply outputs 4 dataframes: a base df, a point df, a correlation text df, and a df for implementing histograms. Its then up to the user to do the plotting themselves as shown in the code below. I think this gives the user more flexibility, e.g. you could easily change the text by using your own Geom.labellayer.

Note that the order of layers seems important, and Geom.subplot_grid seems to prefer that the first and last layer contain something to plot for all grid plots (hence the use of a transparent base layer).

Plotting histograms along the diagonal is problematic (try it by uncommenting the Geom.histogram layer in the code below). The histogram wants the y-axis to start at zero, so the y-axes get rescaled. One way to implement histograms for a scatterplot matrix might be to make a new geometry that autoscales to the current plot axes.

using Colors, DataFrames, RDatasets
iris = dataset("datasets","iris");

Dbase, D, Dcor, Dhist = scatterplotmatdata(iris, colorid=:Species)

theme(x) = Theme(default_color=parse(Colorant, x))
p = plot(
    Dbase, xgroup=:g1, ygroup=:g2,
    Geom.subplot_grid(
        layer(x=:x,y=:y, Geom.point, theme("transparent")),
# layer(Dhist, xgroup=:g1, ygroup=:g2,  x=:x, theme("gray"),  Geom.histogram(density=true)),
       layer(Dcor, xgroup=:g1, ygroup=:g2, x=:x,y=:y, label=:label, Geom.label(position=:centered)),
        layer(D, xgroup=:g1, ygroup=:g2, x=:x, y=:y, Geom.point, color=:col),
        free_y_axis=true, 
        free_x_axis=true
    ),
    Guide.title("Iris Data"),
    Guide.colorkey("Species")
)

scatplotmat

@versipellis
Copy link

Bit of a bump, but does Gadfly natively support scatter matrices yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants