Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
381 lines (280 sloc) 19.4 KB
---
title: Automated Data Reports with R
author: Paul Campbell
description: "Create reproducible reports from data cleaning to visualisation with rmarkdown"
date: '2018-10-22'
slug: automated-data-reports-with-r
twitterImg: img/rmarkdown_ouputs.png
categories:
- Tutorials
tags:
- R
- rmarkdown
- data reporting
- data visualisation
- ggplot2
params:
continent: Europe
---
```{r include=FALSE}
knitr::opts_chunk$set(fig.width = 9,
fig.height = 5,
fig.align = "center",
cache = FALSE,
message=FALSE,
warning=FALSE,
echo=TRUE
)
```
A lot of data analysts will find themselves doing repetitive manual tasks on a data set every day/week/month in Excel, then copying and pasting their updated pivot tables and charts into Word or PowerPoint reports for their stakeholders. If this sounds like your job description, you may want to consider switching to a programming language like R.
Writing scripts will allow you to automate the majority of these processes; from importing your data all the way through to emailing your boss the final report. They'll never know you were actually in the pub the whole time.
---
## Automation
Automation may sound like a scary word to any human being with a job that would like to keep it, but learning to automate some of your most common data tasks can be seriously beneficial to both your organisation and your own job security! Some of the benefits of an automated reporting workflow over a manual one include:
1. **It saves you time.** Most people will feel like they don't have the time they need to fulfill all that is asked of them at work. So if you can cut out the time taken on manual data processing and focus more analysis and insight, that can only be a good thing.
2. **It reduces errors.** When your reporting relies on manual data entry and formula with hard-coded cell references, one typo or out-of-place number can lead to results that are way off the mark. Automating the process with a script will remove the possibility of human error completely.
3. **It expands your data visualisation options.** Using an open-source software like R will allow to draw on a vast array of tools and charting libraries not available in proprietary software. For example, HTML reports with the <a href="https://rmarkdown.rstudio.com/" target="_blank">rmarkdown</a> package can include interactive charts, maps and tables that utilise the latest web-technologies - more on this below.
---
## Worked Example
In this post we're going to run through what might be a typical workflow of building a reproducible data report in R - that is, one that you can simply hit 'Run' on whenever a new wave of data comes in, and all your charts and tables update accordingly giving you the latest insights in your data at the touch of a button.
I'm going to use the classic gapminder dataset that contains population, life expectancy, and GDP per capita metrics for many countries over time. There is a gapminder R package that would conveniently give us the data in a format best suited for data analysis - <a href="https://en.wikipedia.org/wiki/Tidy_data" target="_blank">tidy data</a> - but we want to make this run-through as realistic as possible.
As such, I've taken this tidy data set, removed one of the variables that we'll have to add ourselves later, then split it up by year across multiple excel sheets in a single excel file. Sacrilige, I know, but from experience this is a very common way data with multiple time periods is stored and distributed, so it will be useful to know how to bring it back together again with R code. If you're interested in the code required to commit such a heinous crime of 'untidying' a data set into excel sheets you can check out the <a href="https://gist.github.com/PaulC91/a3c61cee7bcdef7bba31d675bfa726eb" target="_blank">make gapminder messy again</a> R script over on github.
<img src="https://raw.githubusercontent.com/PaulC91/intro_to_r/master/static/imgs/gapminder_messy.png" alt="messy excel data" width="100%" />
To see the full file in all its glory <a href="https://view.officeapps.live.com/op/view.aspx?src=https%3A%2F%2Fwww.cultureofinsight.com%2Fdata%2Fgapminder_messy.xlsx" target="_blank">click here</a>.
As you can see from the Excel file, each tab contains data for the year denoted in the tab name. Here we have 5 year intervals between data but in a more realistic scenario these would be daily, weekly or monthly updates for which this process would work equally well. Similarly, if the data was seperated by a different type of variable such as survey results for different demographic target groups, the process below would allow you to combine the data and visualise the differences between the demographics.
So let's begin with the code! If you're brand new to R don't worry if you don't know the ins and outs of the syntax. Hopefully you will see that we can get a lot done with not a lot of code!
---
## Load the Libraries
First thing first is to load the libraries we'll be using for our report and set a default ggplot2 chart theme.
```{r setup, cache=FALSE}
library(tidyverse)
library(readxl)
library(hrbrthemes)
library(scales)
library(countrycode)
library(glue)
library(plotly)
library(crosstalk)
library(DT)
library(gganimate)
library(widgetframe)
theme_set(theme_ft_rc())
```
---
## Data Importing & Tidying
This is the part that can be most satisfying to automate because the manual equivalent is often very tedious, time-consuming work. We're going to program a process that will:
- iterate over each sheet in the excel file
- pull out only the data table section (starting on row 5)
- add a `year` column populated by the name of the sheet
- combine all the seperate tables into one single data frame
- re-order the columns to our desired specifications
The reproducibility of this code comes from the fact that it is agnostic to the number of sheets in our excel file. That means whenever we get an updated file with a new tab of data, we just point the same code to this file and we'll get a new dataset that includes this new data. Any future computations we draw from the data such as latest period-on-period changes will automatically be derived from the latest available data that we have just received.
```{r cache=TRUE}
path <- "../../data/blog_data/gapminder_messy.xlsx"
combined_data <-
excel_sheets(path) %>%
map_df(~ {
read_excel(path, sheet = .x, skip = 4, trim_ws = TRUE) %>%
mutate(year = as.integer(.x))
}) %>%
select(country, year, everything())
```
Let's quickly inspect our data to check that it got it the job done...
```{r}
# top 6 rows
head(combined_data)
# bottom 6 rows
tail(combined_data)
```
Looking good. However part of our analysis is going to involve computing continental summary statistics, but we currently don't have a continent column in our dataset. Adding one in Excel would require us making our own lookup table with a corresponding continent for each unique country in the dataset then using a `VLOOKUP` function.
In R we can easily make this part of the automated workflow using the `countrycode` package to translate from one geographical coding scheme to another - in our case getting a continent name based on a country name, like so:
```{r}
combined_data <- combined_data %>%
mutate(continent = countrycode(sourcevar = country, origin = "country.name", destination = "continent")) %>%
select(continent, everything())
head(combined_data)
```
---
## Paramaterising Your Report
We now have a consolidated dataset with an added continent variable coded to each country. A common case with data reporting would be to produce distinct reports from multiple subsets of a single data set. This is made easy using `rmarkdown` with the ability to supply a parameter at the top of your report script which can then be used as a variable in the code within the report to alter the outputs.
Let's say in our example we want to batch produce a report for each continent. If we supply the name of the continent we want to build the report for in the `params` section of what is known as the `YAML` at the top of the report script like so:
```
---
title: My Report
output: html_document
params:
continent: Europe
---
```
We can then use that parameter to filter our consolidated data set to only countries matching the continent parameter and then use that data for all subsequent analysis.
```{r}
filtered_data <- combined_data %>%
filter(continent == params$continent)
```
Once we have our full report script ready to be executed, all we have to do to get a report from one continent to the next is change the value of the continent param at the top of the script.
But before we do that we need to code out the basis of the analysis we want contained in our reports. So here are a few examples of the type of data visualisations that would help a stakeholder understand the latest metrics and insights from the data.
For the sake of brevity I'm not including all the code for the charts and tables. But like the data importing and tidying code, once it is setup and working no amendments are required for each new wave of data we get to report on. They'll automatically include the latest metrics for the data we have just processed.
If you're interested in seeing the full code for this report you can check it out over on [github](https://github.com/PaulC91/cofi_website/blob/master/content/blog/2018-08-20-automated-data-reports-with-r.Rmd).
---
## Linked Interactive Graphics
In an HTML report, we can use linked interactive graphics to encourage users to engage with the data and explore the insights. Let's build some charts to show the latest period percentage changes of all 3 metrics in the data set.
First we do the data wrangling to compute the changes...
```{r}
p_change <- filtered_data %>%
group_by(continent, country) %>%
arrange(year) %>%
mutate(gdpPercap_change = (gdpPercap - lag(gdpPercap)) / lag(gdpPercap),
lifeExp_change = (lifeExp - lag(lifeExp)) / lag(lifeExp),
pop_change = (pop - lag(pop)) / lag(pop))
latest_figs <- p_change %>%
filter(year == max(year)) %>%
select(-year) %>%
arrange(country) %>%
ungroup()
```
Then we can visualise the results interactively with the linked functionality.
Use the search bar to highlight a particular country across all 3 charts or click directly on one of the bars. Double clicking will deselect the highlighted country.
```{r linked-charts, eval=TRUE, echo=FALSE}
# create a shared data source with the crosstalk package
sd <- SharedData$new(latest_figs, ~country, group = "Highlight a Country")
# build 3 static ggplot charts with the shared data
sd_gdp <- ggplot(sd, aes(reorder(country, -gdpPercap_change), gdpPercap_change,
text = glue::glue("{country}, {continent}: {percent(gdpPercap_change)}"))) +
geom_col(aes(fill = gdpPercap_change)) +
scale_y_percent() +
scale_fill_viridis_c(guide = FALSE) +
labs(title = "GDP Per Capita % Change", y = "GDP per capita") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
panel.grid.major.x = element_blank())
sd_le <- ggplot(sd, aes(reorder(country, -lifeExp_change), lifeExp_change,
text = glue::glue("{country}, {continent}: {percent(lifeExp_change)}"))) +
geom_col(aes(fill = lifeExp_change)) +
scale_y_percent() +
scale_fill_viridis_c(guide = FALSE) +
labs(title = "Life Expectacy % Change", y = "Life Expectancy") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
panel.grid.major.x = element_blank())
sd_pop <- ggplot(sd, aes(reorder(country, -pop_change), pop_change,
text = glue::glue("{country}, {continent}: {percent(pop_change)}"))) +
geom_col(aes(fill = pop_change)) +
scale_y_percent() +
scale_fill_viridis_c(guide = FALSE) +
labs(title = "Latest Period % Change", y = "Population") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
panel.grid.major.x = element_blank())
# make them interactive with plotly
sd_gdp_i <- ggplotly(sd_gdp, tooltip = "text")
sd_le_i <- ggplotly(sd_le, tooltip = "text")
sd_pop_i <- ggplotly(sd_pop, tooltip = "text")
# arrange them together
subplot(sd_gdp_i, sd_le_i, sd_pop_i, nrows = 3, titleY = TRUE) %>%
config(collaborate = FALSE, displaylogo = FALSE) %>%
hide_legend() %>%
highlight(selectize = TRUE, on = "plotly_click", off = "plotly_doubleclick") %>%
frameWidget(height = "600px")
```
<br>
---
## Searchable Data Tables
We can also view the same set of data in an interactive data tables with search functionality and column sorting to allow users to quickly find the numbers they are looking for. Clicking on column headers will sort the data ascending or descending.
```{r echo=FALSE}
# fmrt <- formatter("span",
# style = x ~ style(color = ifelse(x > 0, "green", ifelse(x < 0, "red", "black"))),
# x ~ icontext(ifelse(x > 0, "arrow-up", "arrow-down"), percent(x, digits = 1)))
mills <- scales::unit_format(scale = 1e-6, accuracy = .01, unit = "M")
tdat <- latest_figs %>%
select(-continent) %>%
select(1:2, 5, 3, 6, 4, 7) %>%
mutate(pop = mills(pop))
# ftbl <- formattable(tdat, list(
# gdpPercap_change = fmrt,
# lifeExp_change = fmrt,
# pop_change = fmrt
# ))
DT::datatable(tdat, rownames = FALSE, caption = "Latest Period Figures (2007) & % Changes (2002-2007)",
colnames = c("Country", "GDP Per Capita", "+/-", "Life Expectancy", "+/-", "Population", "+/-"),
options = list(dom = 'ftip',
columnDefs = list(list(className = 'dt-center', targets = c(1,3,5))))
) %>%
formatCurrency(columns = 2) %>%
formatRound(columns = 4, digits = 1) %>%
formatPercentage(columns = c(3,5,7)) %>%
frameWidget(height = "500px")
```
---
## Static Charts with ggplot2
The sad truth is that most managers just really love Microsoft Office or PDF reports. Fear not though, because with `ggplot2` we have the world's best static charting library at our disposal and getting these charts in an automated Word, PowerPoint, or PDF report with `rmarkdown` is a walk in the park.
The chart below combines all three metrics in our data into one chart - a la Hans Rosling (RIP) - for our latest wave of data. The position of the country annotations are automated with the [`ggrepel`](https://github.com/slowkow/ggrepel) package and computed to repel overlapping text labels away from each other.
```{r echo=FALSE, dpi=300}
p <- filtered_data %>%
filter(year == max(year)) %>%
ggplot(aes(gdpPercap, lifeExp, label = country)) +
geom_point(aes(size = pop), fill = "SteelBlue", colour = "white", alpha = .7, pch = 21) +
ggrepel::geom_text_repel(colour = "white", force = 50, segment.colour = "grey", size = 3) +
scale_size(range = c(2, 12)) +
labs(x = "GDP per capita", y = "Life Expectancy",
title = "Latest European Metrics", subtitle = glue("Year: {max(filtered_data$year)}"),
caption = "@paulcampbell91 | Culture of Insight | Source: Gapminder", size = "Population")
p
```
---
## Animated Charts
Although we have been using a data set filtered to our chosen contininet parameter of Europe, we still have the full data set available to use. We can look at the continental summary statistics over time with the following data...
```{r}
continent_summary <-
combined_data %>%
group_by(continent, year) %>%
summarise(gdpPercap = weighted.mean(gdpPercap, pop),
lifeExp = weighted.mean(lifeExp, pop),
pop = sum(pop)) %>%
ungroup() %>%
gather(metric, value, 3:5)
```
Then visually animate the progress over time...
```{r echo=FALSE, fig.height=7, cache=TRUE}
ggplot(continent_summary, aes(year, value, group = continent)) +
facet_wrap(~metric, ncol = 1, scales = "free_y") +
geom_line(aes(colour = continent)) +
geom_segment(aes(xend = 2007, yend = value), linetype = 2, colour = 'grey') +
geom_point(size = 2, colour = "white") +
geom_text(aes(x = 2007.1, label = continent), hjust = 0) +
transition_reveal(continent, year) +
coord_cartesian(clip = 'off') +
labs(title = "Continental Progress", y = NULL,
caption = "@paulcampbell91 | Culture of Insight | Source: Gapminder") +
theme(plot.margin = margin(5.5, 40, 5.5, 5.5), legend.position = "none")
```
All credit to [Thomas Lin Pedersen](https://twitter.com/thomasp85)'s outstanding new [`gganimate`](https://github.com/thomasp85/gganimate/) package for making charts like the above so intuitively simple to make.
---
## Fin
To sum up, we now have an `rmarkdown` script that is capable of:
1. combining data from excel tabs into a single data frame
2. filtering it by a paramerter we supply
3. computing new variables + summaries of the data
4. visualing the results in interactive, static and animated charts
To be super efficient, we can batch produce all 5 reports (one for each continent) with a single function! The code below sets the list of parameters we want to iterate over, creates an `rmarkdown::render` function to generate the html report from the script, then renders the report for each continent with `purrr::walk` (applies the function to each item in the continents list).
```{r eval=FALSE}
continents <- c("Asia", "Europe", "Africa", "Americas", "Oceania")
renderMyReport <- function(continent) {
rmarkdown::render("report.Rmd",
output_file = paste0(continent,"_report_", Sys.Date(), ".html"),
params = list(continent = continent),
output_options = list(self_contained = FALSE, lib_dir = "libs"))
}
purrr::walk(continents, renderMyReport)
```
You can see all the outputs of the above code using a cutdown version of this report script [over here](https://paulc91.github.io/gapminder_reports/).
In a real-world scenario you'd most likely want to add some distinct narrative in each report, but letting R take care of all the data processing and visualisation for you is going to give you much more time to dig out the real insights and find a solution to all global population, wealth and health probelms! `r emo::ji("sweat_smile")`
---
## Learning
Making the switch to doing your data work with R can be daunting, but with the combination of <a href="https://www.rstudio.com/products/RStudio/" target="_blank">RStudio</a> as your 'integrated development environment', the <a href="https://www.tidyverse.org/" target="_blank">tidyverse packages</a> for data analysis, and <a href="https://rmarkdown.rstudio.com/" target="_blank">rmarkdown</a> for producing reports in just about any format you'd like (HTML, PDF, Word, PowerPoint, Dashboard, even entire websites like this one!), there has never been a better time to make the leap and start your journey to better data reporting workflows. There's also a really great and welcoming community of R users online who are always happy to help new users feel at home.
If you'd like some help getting started, we're currently running a 1-day 'Master Data with Code' workshop where you'll learn how to import, manipulate, join and transform data using the `tidyverse` in the RStudio IDE.
We also offer bespoke in-house training for teams where we focus on your own specific data tasks and how you can use R for greater accuracy and efficiency in data processing, and engagement in your communicated insights.
For more information get in touch via our [Contact Page](https://cultureofinsight.com/contact/).
Thanks for reading!