Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
560 lines (398 sloc) 18.2 KB
title output author version
ggplot2 tutorial
html_document
| Alastair Kerr^1^ | | 1. Manager, WTCCB Bioinformatics Core Facility, University of Edinburgh |
0.3

Outline

The ggplot2 (grammar for graphics) package provides a flexible and powerful set of functions that, even with default settings, can provide useful and visually appealing graphics. However for a beginner, the wealth of options can be overwhelming. The goal of this tutorial is to help you learn the basic syntax of each element you need to add, to create a professional looking graph. Useful libraries that provide additional functionality are also introduced.

Learning Objectives

By the end of the tutorial, you'll be able to:

  • Create and save basic graphics
  • Understand how to create different types of graphics by applying geometries
  • Apply additional layers to the graphic
  • Visually subset the graphic by applying fills and gradients
  • Change the appearance of the graphic by using themes
  • Create sub-graphics by applying facets
  • Understand the reason and methods to change the underlying dataframe to a tidy (long) format
  • Use additional libraries such as plotly to enhance the utility of the graphic.

Library Install and Usage

Libraries provide additional functions, and the packages that they are present in can be downloaded from a network of servers called the Comprehensive R Archive Network [CRAN].

Installing packages:

#install.packages(“PACKAGENAME”)
#Today we will use the following packages
#Please ensure that are installed before attending the session
install.packages(c("ggplot2", "ggthemes", "ggsci", "reshape2", "plotly","ggExtra", "svglite"))

CRAN packages as well as many biology specific “Bioconductor” packages can be installed using bioconductor.

source("http://bioconductor.org/biocLite.R")
biocLite("PACKAGENAME")

Other options exist such as installing from github or from local files.

#We will also be using this package briefly, please install before session starts
install.packages("devtools")
library(devtools)
devtools::install_github('baptiste/egg')

I had trouble on my windows laptop when R was installed in the default "Program Files". I removed it and reinstalled it to a directory without spaces to enable this command to work.

To use libraries from the various packages, use the "library" function.

library(ggplot2)
library(ggthemes)
library(ggsci)
library(reshape2)
library(plotly)
library(ggExtra)
library(egg)
library(svglite)

These are the packages that we will use today. We will be mostly using ggplot2 and touch on the functionality of the other packages.

Fisher's Iris dataset (pre-installed in R)

iris

Fisher examined the length and width of the petals and sepals in irises to determine a classification algorithm. Here we will explore the relationship between species and the dimension of the flowers via the ggplot2 plotting package. Fisher’s data is stored as a built in data set in R. To see the data type

iris

or

head(iris)

If this does not work, you may first need to load the data via the data function

data(iris)
head(iris)

Note that this is a data frame. A data frame is a collection of vectors that can be accessed using the vector name or the column number. e.g.

iris$Sepal.Length 
# or 
iris[,1]

This data frame has two types of data, continuous and discrete.

str(iris)

The continuous data is of type number and the distrete data is of type factor, which can be one of the 3 species

levels(iris$Species)

ggplot2 overview

In this workshop we will be using the ggplot2 package for plotting.

library(ggplot2)

ggplot2 is a suite of functions that create a “grammar” for creating graphs. “Layers” of the graph can be added together. You do not need to specify details for all the layers as many of the default values are fine.

First let's define the dataframe to use

p <- ggplot(data=iris) 

At this stage you can also define what columns to plot within the dataframe by using the aesthetics function, aes(). If you do this, it will be default for the whole graph. You can also define the aesthetics later within specific layers.

Here are the vector names within the iris dataframe

names(iris)

Let us plot “Sepal.Length" vs "Sepal.Width" and store it in an object "p"

p <- ggplot(data=iris, aes(x=Sepal.Length, y=Sepal.Width) )

Note that we still need to add layers to this plot,

p

The key layer to define is the geometry of the plot. There are many geometries to choose from but for now let us just use points.

p <- p + geom_point()

So "p" now contains the data and a geometry layer. Now we can plot a graph!

p

Note that each layer you add will inherit the attributes from the first layer. Aesthetics can be overwritten but you may want to assign them later in the geometries for clarity. Note the following:

ggplot(data=iris, aes(x=Sepal.Length, y=Sepal.Width) ) + geom_point(aes(x=Petal.Length, y=Petal.Width)) 

Although there is no geometry for the 1st set of data, the labels are used but the data has been overwritten using the aesthetics in the geometry.

Lets go back to our object "p" with sepal length versus septal width

We can colour the points in the geometry.

p <- p + geom_point(colour="red")
p

Not very informative though, let’s start again and colour the points by species. Note as well that the data, x and y labels are also not neccessary if the data is in the correct order (but label names will aid in the clarity of your code).

p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, colour=Species) ) +  geom_point()
p

Also we can use the aes_string() function at any time in place of aes(). The only difference is that the column names need to be in inverted commas. This also allows the use of variables containing the name. i.e.

sw <- "Sepal.Width"
ggplot(iris, aes_string("Sepal.Length", sw, colour="Species") ) +  geom_point()

Each function within the ggplot package has its own help file associated with it.

?geom_point()

Note that all the geometries also define a statistical transformation. In this case identity is just the value that is presented. Other geometries have different defaults.

Also the points were coloured by species using

colour=Species

If a geometry was used that had boxes, such as geom_boxplot() or geom_bar(), these boxes can be filled with colour using

fill=Species

within the aesthetics function, aes().

Here is an example using the histogram geometry. I'm setting the binwidth to a more sensible default.

h<- ggplot(iris, aes(Sepal.Length/Sepal.Width, fill=Species )) +geom_histogram(binwidth=0.1)
h

Another example using the violin geometry. Here we are using the "group" aesthetic to turn continous data into discrete levels.

Sepal.Length.Groups <- cut_width( iris$Sepal.Length, 0.5 )
b <- ggplot(data = iris) +  
  geom_violin( aes( x = Sepal.Length, y = Sepal.Width, 
                    group = Sepal.Length.Groups,
                    fill = Sepal.Length.Groups ) ) +
  labs( fill = "Sepal length intervals", 
        x = "Sepal length", 
        y = "Sepal width" )
b

Changing the theme

Recap:

p <- ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, colour=Species) ) +  geom_point() 

We can continue to modify the plot. Let’s make the axis a bit prettier with the xlab() and ylab() functions

p <- p + xlab('Sepal length') + ylab('Sepal width') 
p

All the characteristics of the plot, such as text size and the background are managed by the function called theme(). To see what a theme can change:

?theme

You can use a theme with predefined defaults

p + theme_bw()

or create you own from scratch

p + theme(
    		panel.background = element_blank(), 
     		panel.grid.major = element_line(colour = "darkgrey"), 
     		text = element_text(size=12), 
     		axis.title.x=element_blank(), 
     		axis.title.y=element_blank(), 
     		panel.grid.minor.y=element_blank(),  
     		panel.grid.major.y=element_blank()     
    		)
# Note you can save your theme and reuse it 
theme_for_nature <- theme(
    		panel.background = element_blank(), 
     		panel.grid.major = element_line(colour = "darkgrey"), 
     		text = element_text(size=12), 
     		axis.title.x=element_blank(), 
     		axis.title.y=element_blank(), 
     		panel.grid.minor.y=element_blank(),  
     		panel.grid.major.y=element_blank()     
    		)

We can reuse the theme, for example on the histogram we made earlier.

h + theme_for_nature

Note that theme() changes the underlying graph properties and not the layers themselves.

There are pre-computed themes availabe, R-studio will offer help choosing them by offering choices after typing "theme_". There are other packages available to download such as the ggthemes package. Here is one based on https://fivethirtyeight.com/ .

library(ggthemes)
h + theme_fivethirtyeight()

Saving the plots

In RStudio there are many options to save the image. From the Plots tab, select Export and "Save as Image".

If you are are wanting to use ggplot2 in a script, or from another platform, you can export a graph using the function ggsave().

#save a png file 
ggsave("IrisDotplot.png", p)
#save a jpeg file
ggsave("IrisDotplot.jpg", p)
#save a pdf 
ggsave("IrisDotplot.pdf", p)
#save a svg
ggsave("IrisDotplot.svg", p)

In this you can also set the resolution for the image as well as the length and width for the image. See the help page for ggsave for more options.

Colouring points and bars

As we have seen, points (and bars) can be coloured by a value in your dataframe using the aes() function. Earlier we coloured by a factor but we can also colour by a value which will create a gradient of colour. Lets look at the Petals this time and colour by the ratio of Sepal.Length to Sepal.Width.

q <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Sepal.Length/Sepal.Width )) +  geom_point(alpha=0.5) 
q

To change the colours we need to use another suite of functions which are not unique to ggplot2. There are several options and we will go through a few examples.

Lets first look at a function from colorbrewer2.org. Details can be found here.

In this package continuous scales use scale_color_distiller() function whereas discrete factors can scale_colour_hue.

q + scale_color_distiller(palette="Paired")

An alternative method is to use the scale_color_gradient2() function.

q + scale_color_gradient2(high="darkred", low="blue",  mid="red", midpoint=2, space="Lab")

Note that for bars, you need to specify the fill colour. Lets use the histogram, h, we made earlier.

h+ scale_fill_manual(values = c("magenta","cyan", "darkblue"))

The options for colouring graphs are huge. The help pages for many of the functions will refer to the general help page for that package.

?scale_color_distiller
?scale_fill_manual
?scale_color_brewer

There is plenty of help to be found on the web for colouring in R such as this in-depth "CookBook" of R code snippets for ggplot2

Moreover color brewer 2 is a great site for getting the hex values of colours to suite needs such as printer or colour blind friendlyness.

Like themes, there are several packages available on CRAN or other package sites that have pre-built colour pallates. ggsci is one that contains palletes based on key journals.

e.g. for Nature Publishing Group

library(ggsci)
h + scale_fill_npg()

Faceting

Let's look at plot q again

q <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Sepal.Length/Sepal.Width )) +  geom_point() 
q

We have most of the data in this plot except Species.

Multiple plots can easily be generated if we have one or two factors by which to split the plots. The splitting is called facetting. This is achieved using either the function facet_grid() or facet_wrap().

q + facet_grid(~Species)

In this case facet_wrap() would have produced an identical plot. The syntax facet_grid(factorY ~ factorX) will produce an X by Y grid with all factorY values down one side and all factorX values along the other. For instance:

Countries <- c("Italy", "Spain", "France", "UK")

IrisDataWithCountries <- iris
IrisDataWithCountries$Country <- sample(Countries, nrow(iris), replace = TRUE)

q <- ggplot(IrisDataWithCountries, aes(x = Petal.Length, 
                                       y = Petal.Width, 
                                       colour = Sepal.Length/Sepal.Width ) ) + geom_point() 
q + facet_grid(Country ~ Species)

If data is missing for pairs of factors, a blank graph will be produced. facet_wrap() does not constrain the multiple plots into a grid and instead just creates all plots that have data.

We can also tell facet wrap to print in a certain number of columns. Using our histogram, h, from before.

h  + facet_wrap(~Species, ncol=1) 

Additional functions

The boxplot geometry needs the x value to be a factor. If we try and plot it the wrong way, it will not work correctly.

b <- ggplot(iris, aes(y=Species, x=Sepal.Length )) +  geom_boxplot() 
b

Whereas this is ok

b <- ggplot(iris, aes(x=Species, y=Sepal.Length )) +  geom_boxplot() 
b

To change the axes we can use another function called coord_flip()

b <- b + coord_flip()
b

We can also reverse the order of the axes. Note that even although we have flipped the plots, the Species is still technically the x-axis.

b + scale_y_reverse()

Changing the order of the Species is best done by creating ordered factors in the iris dataset. Also we need to generate the plot again as the underlying data has changed.

iris$Species <- ordered(iris$Species, levels=c("virginica", "setosa",  "versicolor"))

b <- ggplot(iris, aes(x=Species, y=Sepal.Length )) +  geom_boxplot() + coord_flip()
b

More details at the cookbook snipped for ordering factors

Adding more layers

Sometimes it makes sense to add ggplot2 geometries to a graph you already have. geom_jitter() is a good example. From our boxplot we can add that points that contribute to it, staggered over the plot so they do not overlap.

b + geom_jitter()

Problems with overlays and advantages of the long format

Say we want to plot all values per species as a boxplot.

As stated earlier, defaults are generated early as we add data to the plot. Also it is difficult to interact with the layers to make the plot look nice, even if we add some opacity using an alpha value.

d <- ggplot(iris)
b1 <- geom_boxplot(aes(x=Species, y=Sepal.Length, alpha=0.5))
b2 <- geom_boxplot(aes(x=Species, y=Sepal.Width, alpha=0.5))
b3 <- geom_boxplot(aes(x=Species, y=Petal.Width, alpha=0.5))
b4 <- geom_boxplot(aes(x=Species, y=Petal.Length, alpha=0.5))
d + b1 +b2 + b3 + b4

Note how the axis labels are set to the first layer. Also there is not much control on where the boxes are placed.

One way around this, is to change the data for the plot. We can change the data to a "long" format such that each column header becomes a factor for the value. We can use the reshape2 library to do this.

library(reshape2)
iris.long <- melt(iris)
head(iris.long)

The melt function has correctly guessed that we want to use Species as the varable keep for the identifier. If we have multiple columns we wanted to keep, we could use the id.vars argument, e.g.

melt(data, id.vars=c("Col1", "Col2"))

NOTE: The "gather()" function from tidyverse can also be used.

Let's rename the variable column to something sensible.

#check what we want to change 
names(iris.long)
#change just column 2
names(iris.long)[2] = "Flower.Part"
#Check the results
names(iris.long)

Now it is easy to create the boxplot we wanted.

bigbox <- ggplot(iris.long, aes(Species, value, fill=Flower.Part)) + geom_boxplot()
bigbox

Also note that the 1st value to aes() is assumed to be x, the second assumed to be y.

Interactive graphs with Plotly

"Plotly for ggplot2 is an interactive, browser-based charting library built on the open source javascript graphing library, plotly.js. It works entirely locally, through the HTML widgets framework. See up-to-date documentation and examples at https://plot.ly/ggplot2"

library(plotly)

Use the ggplotly function to get an interactive plot

ggplotly(p)

By default the mouse over text is what is mapped in the aesthetics. You can change this using the tooltip() function within plotly and giving some false aesthetics to enable more data to be accessible by tooltip.

p2<- p+ geom_point(aes(pl=Petal.Length, pw= Petal.Width))
#now you can use either the false aesthetic or its name
ggplotly(p2, tooltip=c("Species", "Petal.Length", "pw") )
# bit of a hack but it works...

Plot.ly graphs can be used locally, or online either through a shiny server (you need to use renderPlotly and plotlyOutput) or using plot.ly's servers to get a URL (details on their web page).

R-Studio Add-ins

The latest version of R-studio has some useful addins. "ggExtra" has the ggMarginal function that allows you to add probability density plots to the margins of your plot.

library(ggExtra)
ggMarginal(p)

Lets change the theme so that the legend is not in the way and then change the margin plot to a histogram.

ggMarginal(p + theme_fivethirtyeight(), type="histogram", binwidth=0.1)

Multi-plot Layouts

The egg library makes it easy to arrange multiple plots

library(egg) # devtools::install_github('baptiste/egg')
ggarrange(p,q,h)

The man page will provide more information on how to specify number of columns or rows, e.g.

ggarrange(p,q,h, ncol=2) 

and it can be combined with ggsave

ggsave("test.png", ggarrange(p,q,h, ncol=2) )