<!--html_preserve-->
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-130562131-1"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-130562131-1');
</script>
<!--/html_preserve-->


In [None]:
knitr::include_graphics("https://slcladal.github.io/images/uq1.jpg")



# Introduction{-}

This tutorial introduces data visualization using R and shows how to modify different types of visualizations in the ggplot framework in R. The entire R-markdown document can be downloaded [here](https://slcladal.github.io/rscripts/introviz.Rmd). 

When it comes to data visualization, R offers a myriad of options and ways to show and summarize data which makes R an incredibly flexible tool that offers full control over the distinct layers of plots. Rather than showing how to produce different types of plots (e.g. scatter plots, box plots, and line graphs), this introduction will focus on the three main frameworks for data visualization in R (base, lattice, and ggplot) and show how you can modify your visualizations (e.g. changing axes and tick labels, change colors, and showing different plots in one window). How to create different types of plots is shown in [this tutorial](https://slcladal.github.io/basicgraphs.html). We separate between this introduction and showing how to produce different types of visualizations because rather general questions relating to what needs to be kept in mind when visualizing data are discussed. The practical part presents the code used to set up graphs so that they can be recreated and also discusses potential problems that you may encounter when setting up a graph. 

As there exists a multitude of different ways to visualize data, this section only highlights the different philosophies that underlie the different frameworks for data visualization in R (base, lattice, and ggplot) and how to modify visualizations to match one's individual needs. The major advantage of using R consists in the fact that the code can be stored, distributed, and run very easily. This means that R represents a flexible framework for creating graphs that enables sustainable, reproducible, and transparent procedures. 

## Basics of data visualization{-}

Before turning to the practical issues relating to creating graphs, a few words on what one has to keep in mind when visualizing data are in order. On a very general level, graphs should be used to inform the reader about properties and relationships between variables. This implies that...

* graphs, including axes, must be labeled properly to allow the reader to understand the visualization with ease. 

* there should not be more dimensions in the visualization than there are in the data.

* all elements within a graph should be unambiguous.

* variable scales should be portrayed accurately (for instance, lines - which imply continuity - should not be used for categorically scaled variables).

* graphs should be as intuitive as possible and should not mislead the reader.

## Different philosophies: base R, lattice, and ggplot{-}

A few words on different frameworks for creating graphics in R are in order. There are three main frameworks in which to create graphics in R. The *basic* framework, the *lattice* framework, and the *ggplot* or *tidyverse* framework. 

### The base R framework{-}

The *base R* framework is the oldest of the three and is included in what is called the `base R` - a collection of about 30 packages that are automatically activated/loaded when you start `R`.  The idea behind the "base" environment is that the creation of graphics is seen in analogy to a painter who paints on an empty canvass. Each line or element is added to the graph consecutively which oftentimes leads to code that is very comprehensible but also very long.

### The lattice framework{-}

The *lattice* environment was a follow-up to the *base* framework and it complements it insofar as it made it much easier to display various variables and variable levels simultaneously. The philosophy of the lattice-package is quite different from the philosophy of *base*: whereas everything had to be specified in *base*, the graphs created in the *lattice* environment require only very little code but are therefore very easily created when one is satisfied with the design but vey labor intensive when it comes to customizing graphs. However, *lattice* is very handy when summarizing relationships between multiple variable and variable levels.

### The ggplot framework{-}

The *ggplot* environment was written by Hadley Wickham and it combines the positive aspects of both the *base* and the *lattice* package. It was first publicized in the *gplot* and *ggplot1* packages but the latter was soon repackaged and improved in the now most widely used package for data visualization: the *ggplot2* package. The *ggplot* environment implements a philosophy of graphic design described in *The Grammar of Graphics* by Leland Wilkinson [@wilkinson2012grammar].

The philosophy of *ggplot2* is to consider graphics as consisting of basic elements (called *aesthetics* and they include, for instance, the data set to be plotted and the axes) and layers that overlaid onto the aesthetics. The idea of the *ggplot2* package can be summarized as taking "care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics."

Thus, ggplots typically start with the function call (ggplot) followed by the specification of the data, then the aesthetics (aes), and then a specification of the type of plot that is created (geom_line for line graphs, geom_box for box plots, geom_bar for bar graphs, geom_text for text, etc.). In addition, ggplot makes it possible to specify all elements that the graph consists of (e.g. the theme and axes). 

As the ggplot framework has become the dominant way to create visualizations in R, we will only focus on this framework in the following practical examples.

## Preparation and session set up{-}

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R [here](https://slcladal.github.io/intror.html). For this tutorials, we need to install certain *packages* from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).


In [None]:
# install libraries
install.packages(c("tidyverse", "gridExtra", "knitr", "kableExtra"))


Once you have installed R  and  initiated the session by executing the code shown above, you are good to go.

# Getting started{-}

Before turning to the graphs, we will load the packages for this tutorial. The data set is called `lmmdata` but we will change the name to `plotdata` for this tutorial. The data set is based on the [*Penn Parsed Corpora of Historical English*](https://www.ling.upenn.edu/hist-corpora/) (PPC) and it contains the date when a text was written (`Date`), the genre of the text (`Genre`), the name of the text (`Text`), the relative frequency of prepositions in the text (`Prepositions`), and the region in which the text was written (`Region`). We also add two more variables to the data called `GenreRedux` and `DateRedux`. `GenreRedux` collapses the existing genres into five main categories (*Conversational*, *Religious*, *Legal*, *Fiction*, and *NonFiction*) while `DateRedux` collapses the dates when the texts were composed into five main periods (1150-1499, 1500-1599, 1600-1699, 1700-1799, and 1800-1913). We also factorize non-numeric variables. 


In [None]:
# activate packages
library(tidyverse)
library(gridExtra)
library(knitr) 
library(kableExtra)
# load data
plotdata <- read.delim("https://slcladal.github.io/data/lmmdata.txt", header = TRUE) %>%
  mutate(GenreRedux = case_when(str_detect(.$Genre, "Letter") ~ "Conversational",
                                Genre == "Diary" ~ "Conversational",
                                Genre == "Bible"|Genre == "Sermon" ~ "Religious",
                                Genre == "Law"|Genre == "TrialProceeding" ~ "Legal",
                                Genre == "Fiction" ~ "Fiction",
                                TRUE ~ "NonFiction")) %>%
  mutate(DateRedux = case_when(Date < 1500 ~ "1150-1499",
                               Date < 1600 ~ "1500-1599",
                               Date < 1700 ~ "1600-1699",
                               Date < 1800 ~ "1700-1799",
                               TRUE ~ "1800-1913")) %>%
  mutate(Genre = factor(Genre),
         Text = factor(Text),
         Region = factor(Region),
         GenreRedux = factor(GenreRedux),
         DateRedux = factor(DateRedux))


The first six rows of the data look like this:



In [None]:
kable(head(plotdata), caption = "First 6 rows of the plotdata") %>%
  kable_styling(bootstrap_options = "striped", full_width = T, position = "left")


We will now turn to creating the graphs.

# Creating a simple graph{-}

When creating a visualization with ggplot, we first use the function `ggplot` and define the data that the visualization will use, then, we define the aesthetics which define the layout, i.e. the x- and y-axes. 


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions))



In a next step, we add the geom-layer which defines the type of visualization that we want to display. In this case, we use `geom_point` as we want to show points that stand for the frequencies of prepositions in each text. Note that we add the geom-layer by adding a `+` at the end of the line!



In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  geom_point()


We can also add another layer, e.g. a layer which shows a smoothed loess line, and we can change the theme by specifying the theme we want to use. Here, we will use `theme_bw` which stands for the black-and-white theme (we will get into the different types of themes later).



In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  geom_point() +
  geom_smooth(se = F) +
  theme_bw()


We can also store our plot in an object and then add different layers to it or modify the plot. Here we store the basic graph in an object that we call `p` and then change the axes names.  



In [None]:
p <- ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  geom_point()
p + labs(x = "Year", y = "Frequency")


We can also integrate plots into data processing pipelines as shown below. When you integrate visualizations into pipelines, you should not specify the data as it is clear from the pipe which data the plot is using.



In [None]:
plotdata %>%
  dplyr::select(DateRedux, GenreRedux, Prepositions) %>%
  dplyr::group_by(DateRedux, GenreRedux) %>%
  dplyr::summarise(Frequency = mean(Prepositions)) %>%
    ggplot(aes(x = DateRedux, y = Frequency, group = GenreRedux, color = GenreRedux)) +
    geom_line()


# Modifying axes and titles{-}

There are different way to modify axes, the easiest way is to specify the axes labels using `labs` (as already shown above). To add a custom title, we can use `ggtitle`.


In [None]:
p + labs(x = "Year", y = "Frequency") +
  ggtitle("Preposition use over time", subtitle="based on the PPC corpus")


To change the range of the axes, we can specify their limits in the `coord_cartesian` layer.



In [None]:
p + coord_cartesian(xlim = c(1000, 2000), ylim = c(-100, 300))



In [None]:
p <- ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  geom_point() + 
  labs(x = "Year", y = "Frequency")
p + theme(axis.text.x = element_text(face="italic", color="red", size=8, angle=45),
          axis.text.y = element_text(face="bold", color="blue", size=15, angle=90))


In [None]:
p + theme(
  axis.text.x = element_blank(),
  axis.text.y = element_blank(),
  axis.ticks = element_blank())


In [None]:
p + scale_x_discrete(name ="Year of composition", limits=seq(1150, 1900, 50)) +
  scale_y_discrete(name ="Relative Frequency", limits=seq(70, 190, 20))


# Modifying colors{-}

To modify colors, you can include a color specification in the main aesthetics.


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  geom_point() 


Or you can specify the color in the aesthetics of the geom-layer.



In [None]:
p + geom_point(aes(color = GenreRedux))



To change the default colors manually, you can use `scale_color_manual` and define the colors you want to use in the `values` argument and specify the variable levels that want to distinguish by colors in the `breaks` argument. You can find an overview of the colors that you can define in R [here](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf).



In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  geom_point()  + 
  scale_color_manual(values = c("red", "gray30", "blue", "orange", "gray80"),
                       breaks = c("Conversational", "Fiction", "Legal", "NonFiction", "Religious"))


When the variable that you want to colorize does not have discrete levels, you use `scale_color_continuous` instead of `scale_color_manual`.



In [None]:
p + geom_point(aes(color = Prepositions)) + 
  scale_color_continuous()


You can also change colors by specifying color `palettes`. Color `palettes` are predefined vectors of colors and there are many different color `palettes` available. Below are some examples using the `Brewer` color palette.



In [None]:
p + geom_point(aes(color = GenreRedux)) + 
  scale_color_brewer()


In [None]:
p + geom_point(aes(color = GenreRedux)) + 
  scale_color_brewer(palette = 2)


In [None]:
p + geom_point(aes(color = GenreRedux)) + 
  scale_color_brewer(palette = 3)


We now use the `viridis` color palette to show how you can use another palette. The example below uses the viridis palette for a discrete variable (GenreRedux).



In [None]:
p + geom_point(aes(color = GenreRedux)) + 
  scale_color_viridis_d()


To use the viridis palette for continuous variables you need to use `scale_color_viridis_c` instead of `scale_color_viridis_d`.



In [None]:
p + geom_point(aes(color = Prepositions)) + 
  scale_color_viridis_c()


The `Brewer` color palette (see below) is the most commonly used color palette but there are many more. You can find an overview of the color palettes that are available [here](https://www.datanovia.com/en/blog/top-r-color-palettes-to-know-for-great-data-visualization/).



In [None]:
library(RColorBrewer)
display.brewer.all()


# Modifying lines & symbols{-}



In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, shape = GenreRedux)) +
  geom_point() 


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions)) + 
  geom_point(aes(shape = GenreRedux)) + 
  scale_shape_manual(values = 1:5)


Similarly, if you want to change the lines in a line plot, you define the `linetype` in the aesthetics.



In [None]:
plotdata %>%
  dplyr::select(GenreRedux, DateRedux, Prepositions) %>%
  dplyr::group_by(GenreRedux, DateRedux) %>%
  dplyr::summarize(Frequency = mean(Prepositions)) %>%
  ggplot(aes(x = DateRedux, y = Frequency, group = GenreRedux, linetype = GenreRedux)) +
  geom_line()


You can of course also manually specify the line types.



In [None]:
plotdata %>%
  dplyr::select(GenreRedux, DateRedux, Prepositions) %>%
  dplyr::group_by(GenreRedux, DateRedux) %>%
  dplyr::summarize(Frequency = mean(Prepositions)) %>%
  ggplot(aes(x = DateRedux, y = Frequency, group = GenreRedux, linetype = GenreRedux)) +
  geom_line() +
  scale_linetype_manual(values = c("twodash", "longdash", "solid", "dotted", "dashed"))


Here is an overview of the most commonly used linetypes in R.



In [None]:
d=data.frame(lt=c("blank", "solid", "dashed", "dotted", "dotdash", "longdash", "twodash", "1F", "F1", "4C88C488", "12345678"))
ggplot() +
scale_x_continuous(name="", limits=c(0,1)) +
scale_y_discrete(name="linetype") +
scale_linetype_identity() +
geom_segment(data=d, mapping=aes(x=0, xend=1, y=lt, yend=lt, linetype=lt))


To make your layers transparent, you need to specify `alpha` values.



In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions)) + 
  geom_point(alpha = .2)


Transparency can be particularly useful when using different layers that add different types of visualizations.



In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions)) + 
  geom_point(alpha = .1) + 
  geom_smooth(se = F)


Transparency can also be linked to other variables.



In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, alpha = Region)) + 
  geom_point()


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, alpha = Prepositions)) + 
  geom_point()


# Adapting sizes{-}



In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, size = Region, color = GenreRedux)) +
  geom_point() 


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux, size = Prepositions)) +
  geom_point() 


# Adding text{-}



In [None]:
plotdata %>%
  dplyr::filter(Genre == "Fiction") %>%
  ggplot(aes(x = Date, y = Prepositions, label = Prepositions, color = Region)) +
  geom_text(size = 3) +
  theme_bw()


In [None]:
plotdata %>%
  dplyr::filter(Genre == "Fiction") %>%
  ggplot(aes(x = Date, y = Prepositions, label = Prepositions)) +
  geom_text(size = 3, hjust=1.2) +
  geom_point() +
  theme_bw()


In [None]:
plotdata %>%
  dplyr::filter(Genre == "Fiction") %>%
  ggplot(aes(x = Date, y = Prepositions, label = Prepositions)) +
  geom_text(size = 3, nudge_x = -15, check_overlap = T) +
  geom_point() +
  theme_bw()


In [None]:
plotdata %>%
  dplyr::filter(Genre == "Fiction") %>%
  ggplot(aes(x = Date, y = Prepositions, label = Prepositions)) +
  geom_text(size = 3, nudge_x = -15, check_overlap = T) +
  geom_point() +
  theme_bw()


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  geom_point() +
  ggplot2::annotate(geom = "text", label = "Some text", x = 1200, y = 175, color = "orange") +
  ggplot2::annotate(geom = "text", label = "More text", x = 1850, y = 75, color = "lightblue", size = 8) +
    theme_bw()


In [None]:
plotdata %>%
  dplyr::group_by(GenreRedux) %>%
  dplyr::summarise(Frequency = round(mean(Prepositions), 1)) %>%
  ggplot(aes(x = GenreRedux, y = Frequency, label = Frequency)) +
  geom_bar(stat="identity") +
  geom_text(vjust=-1.6, color = "black") +
  coord_cartesian(ylim = c(0, 180)) +
  theme_bw()


In [None]:
plotdata %>%
  dplyr::group_by(Region, GenreRedux) %>%
  dplyr::summarise(Frequency = round(mean(Prepositions), 1)) %>%
  ggplot(aes(x = GenreRedux, y = Frequency, group = Region, fill = Region, label = Frequency)) +
  geom_bar(stat="identity", position = "dodge") +
  geom_text(vjust=1.6, position = position_dodge(0.9)) + 
  theme_bw()


In [None]:
plotdata %>%
  dplyr::filter(Genre == "Fiction") %>%
  ggplot(aes(x = Date, y = Prepositions, label = Prepositions)) +
  geom_label(size = 3, vjust=1.2) +
  geom_point() +
  theme_bw()


# Combining multiple plots{-}



In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  facet_grid(~GenreRedux) +
  geom_point() + 
  theme_bw()


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions)) +
  facet_wrap(vars(Region, GenreRedux), ncol = 5) +
  geom_point() + 
  theme_bw()


In [None]:
p1 <- ggplot(plotdata, aes(x = Date, y = Prepositions)) + geom_point() + theme_bw()
p2 <- ggplot(plotdata, aes(x = GenreRedux, y = Prepositions)) + geom_boxplot() + theme_bw()
p3 <- ggplot(plotdata, aes(x = DateRedux, group = GenreRedux)) + geom_bar() + theme_bw()
p4 <- ggplot(plotdata, aes(x = Date, y = Prepositions)) + geom_point() + geom_smooth(se = F) + theme_bw()
grid.arrange(p1, p2, nrow = 1)


In [None]:
grid.arrange(grobs = list(p4, p2, p3), 
             widths = c(2, 1), 
             layout_matrix = rbind(c(1, 1), c(2, 3)))


# Available themes{-}



In [None]:
p <- ggplot(plotdata, aes(x = Date, y = Prepositions)) + geom_point() + labs(x = "", y= "") +
  ggtitle("Default") + theme(axis.text.x = element_text(size=6, angle=90))
p1 <- p + theme_bw() + ggtitle("theme_bw") + theme(axis.text.x = element_text(size=6, angle=90))
p2 <- p + theme_classic() + ggtitle("theme_classic") + theme(axis.text.x = element_text(size=6, angle=90))
p3 <- p + theme_minimal() + ggtitle("theme_minimal") + theme(axis.text.x = element_text(size=6, angle=90))
p4 <- p + theme_light() + ggtitle("theme_light") + theme(axis.text.x = element_text(size=6, angle=90))
p5 <- p + theme_dark() + ggtitle("theme_dark") + theme(axis.text.x = element_text(size=6, angle=90))
p6 <- p + theme_void() + ggtitle("theme_void") + theme(axis.text.x = element_text(size=6, angle=90))
p7 <- p + theme_gray() + ggtitle("theme_gray") + theme(axis.text.x = element_text(size=6, angle=90))
grid.arrange(p, p1, p2, p3, p4, p5, p6, p7, ncol = 4)


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  geom_point() + 
  theme(panel.background = element_rect(fill = "white", colour = "red"))


Extensive information about how to modify themes can be found  [here](https://ggplot2.tidyverse.org/reference/theme.html).

# Modifying legends{-}


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  geom_point() + 
  theme(legend.position = "top")


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, color = GenreRedux)) +
  geom_point() + 
  theme(legend.position = "none")


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, linetype = GenreRedux, color = GenreRedux)) +
  geom_smooth(se = F) +  
  theme(legend.position = c(0.2, 0.7)) 


In [None]:
ggplot(plotdata, aes(x = Date, y = Prepositions, linetype = GenreRedux, color = GenreRedux)) +
  geom_smooth(se = F) + 
  guides(color=guide_legend(override.aes=list(fill=NA))) +  
  theme(legend.position = "top", 
        legend.text = element_text(color = "green")) +
  scale_linetype_manual(values=1:5, 
                        name=c("Genre"),
                        breaks = names(table(plotdata$GenreRedux)),
                        labels = names(table(plotdata$GenreRedux))) + 
  scale_colour_manual(values=c("red", "gray30", "blue", "orange", "gray80"),
                      name=c("Genre"),
                      breaks=names(table(plotdata$GenreRedux)),  
                      labels = names(table(plotdata$GenreRedux)))


# Citation & Session Info {-}

Schweinberger, Martin. 2020. *Introduction to Data Visualization in R*. Brisbane: The University of Queensland. url: https://slcladal.github.io/introviz.html  (Version `r format(Sys.time(), '%Y.%m.%d')`).


In [None]:
@manual{schweinberger2021introviz,
  author = {Schweinberger, Martin},
  title = {Introduction to Data Visualization in R},
  note = {https://slcladal.github.io/introviz.html},
  year = {2021},
  organization = "The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {`r format(Sys.time(), '%Y.%m.%d')`}
}


In [None]:
sessionInfo()



***

[Back to top](#introduction)

[Back to HOME](https://slcladal.github.io/index.html)

***

# References {-}
