Open-source R pipeline to clean and process patient-level Hospital Episode Statistics (HES) and linked ONS mortality data, with the aim to produce analysis-ready datasets for a defined programme of analyses.
Hospital Episode Statistics (HES) is a database containing details of all hosptial admissions, A&E attendances and outpatient appointments at NHS hospitals in England.
Before it can be used for analysis, HES data requires cleaning, quality control and processing to derive additional variables. The complex record structure of HES, the large number of variables and the size of the data sets makes this a challenging task both from an analytical and computational point of view.
The semi-automated workflow we are developing in this repository processes HES data consistently and reproducibly, that all processing steps are documented, designed to ensure that each approved analysis projects is based on the same clean data.
We using HES data linked to ONS mortality data from 2008/09 up to the most recent quarterly release. Our data application has been approved by the NHS Digital [Data Access Request Service Data Access Request Service (DARS).
The data will be accessed in The Health Foundation's Secure Data Environment; a secure data analysis facility (accredited with the ISO27001 information security standard, and recognised for the NHS Digital Data Security and Protection Toolkit). No information that could directly identify a patient or other individual will be used.
The doc folder contains information on:
- HES data cleaning and processing protocol [to be added]
- Logs that are created during the run
- Definitions of derived variables
- Definitions of derived tables
In addition, sections below describe
- pipeline design choices
- how to run the pipeline to prepare a HES extract for analysis
- how to query the resulting SQLite database
- what to avoid when querying the database
As the HES data prepared in this pipeline is not publicly available, the code cannot be used to replicate the same clean data and database. However, the code can be used on similar patient-level HES extracts to prepare the datasets for analysis. For more detailed information on how the pipeline works see below or refer to the process document.
The process document describes the overall design of the pipeline, lists the necessary inputs and a high-level description of the steps in the workflow.
The flowchart shows how user input and data move through the different pipeline functions.
The pipeline can by run in two modes:
- BUILD mode creates a new HES database from scratch (this is the default).
- UPDATE mode incorporates data updates into an existing HES database (if
update = TRUE
). HES data updates within the same year are overlapping, so some of the old data will be dropped and replaced with the new update. ONS mortality data is completely refreshed with each data update.
In BUILD mode, the pipeline
- creates a SQLite database
- reads ONS mortality and HES bridge files, merges them and adds them as a new table to the database
- per HES dataset, reads HES raw data files in chunks and adds it to the respective table in the database after
- checking if all expected columns are present
- coercing data types (optional)
- cleaning variables
- deriving new variables (for variables based on individual records or rows)
- combining with public data on LSOA-level Index of multiple deprivation and CCGs (optional)
- flagging comorbidities and calculating the Charlson, Elixhause and a custom frailty index (optional)
- flags duplicates in the database (optional)
- creates inpatient spells
- creates continuous inpatient spells
- creates summary tables for the clean dataset and saves them to the database and as csv files.
In UPDATE mode, the pipeline
- detects which data year to update from the file name of the raw files to be processed
- deletes the subset of records that will be replaced for each HES dataset as well as the ONS table
- moves the existing data into temporary backup tables
- processes the new data (as above, up to the duplicate flagging step)
- joins the existing records with the new data update
- creates inpatient spells on the combined data
- creates continuous inpatient spells on the combined data
- creates summary tables for the clean dataset and saves them to the database and as csv files.
The architecture decision record (ADR) captures architectural decision and design choices, along with their context, rationale and consequences. In addition, we recorded some analytical decisions.
So far, we have recorded decisions regarding
- where and how the raw data is stored and, if necessary, updated
- how the data is read in in chunks and how to determine the number of required chunks per file
- how dates will be stored in the SQLite database
- the chosen method to compare A&E arrival time of two records while identify duplicate records
- how admission date will be imputed if missing
- hardcodeding of some column names
- the methodology used to create inpatient spells
- the methodology used to create continuous inpatient spells
- the definition of the custom frailty index calculated using admitted patient care data.
The HES pipeline was built under R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night".
The following R packages, which are available on CRAN, are required to run the HES pipeline:
- data.table (1.12.2)
- DBI(1.0.0)
- tidyverse(1.2.1)
- tidylog(0.2.0)
- readxl(1.3.3)
- furrr (0.1.0)
- logger (0.1)
- plyr (1.8.4)
- rlang (0.4.0)
- comorbidity (0.5.3)
The location where the database is created needs to have sufficient storage space available, roughly equivalent to the combined file size of the raw HES data extract plus 2 x file size of the APC data set (as the tables for inpatient spells and continuous inpatient spells will be added).
Some of the processing steps are not performed in memory but as SQLite queries. This includes the duplicate flagging algorithm, spell creation and the creationg of summary statistics tables on the clean data. Depending on the size of the dataset, these steps create large temporary SQLite databases (.etiqls files), which are automatically deleted once the query has been executed. By default, these are created in the R home directory, which is often located on a drive with restricted storage capacity.
We have found that execution of the pieline fails when not enough temporary storage is available (error message 'Database or disk is full'). This can be fixed by changing the location where temporary SQLite databases are created. On Windows, the temporary storage location is controlled by the environmental variable "TMP". We recommended to create a project-level .Renviron file to set TMP to a location with sufficient storage capacity.
-
data_path
Path to the HES data extract.
The pipeline can process any of the following patient-level datasets: HES Admitted Patient Care, HES Accidents & Emergencies, HES Ouptatient care, HES Critical Care and ONS Mortality records (including the bridge file linking it to HES). It requires at least one of them. The raw data files have to be located in the same folder. -
database_path
Path to a folder where the SQLite database will be built. -
data_set_codes
Expected HES datasets in thedata_path
folder.
This should be one or several of "APC", "AE", "CC" and "OP". These identifiers are matched to the names of the raw files, which should be the case for raw HES files received from NHS Digital. ONS Mortality records and ONS-HES bridge files are processed by default if present. The file names for mortality records and bridge files should contain "ONS" and "BF", respectively. -
expected_headers_file
Path to a csv file with expected column names for each data set.
This csv file has at least two columns, namedcolnames
anddataset
, similar to this template. Column headers in the data are automatically capitalised while the data is read in, so the column names in the csv file should be all caps. This information will be used to check whether each raw data file contains all expected columns.
The following arguments have a default setting:
-
chunk_sizes
Number of rows per chunk for each data set.
Each data file is read and processed in chunks of defied a number of rows. The default size is 1 million lines per chunk but this can be modified by the user. Larger chunk sizes, resulting in a smaller number of chunks per file, decrease the overall processing time. This is probably because for each chunk in a given file,fread()
needs progressively longer to move to the specified row number to start reading the data. However, large chunk sizes also increase the time in takes to process each chunk in memory. The optimal chunk size balances processing time with reading time and is dependent on the system and the dataset, as each dataset can have a different number of variables, and therefore requires different amounts of memory per row. It is recommended to run tests on a smaller subset of data first, as very large chunk sizes can cause RStudio to crash. -
coerce
Coercing data types.
By default, thefread()
function used to read in the data will automatically detect column types.
Alternatively, data types can be coerced to user-defined types by setting this argument toTRUE
. Column types are supplied int the third column, calledtype
, in the csv file with the expected column names, see this template. Note that SQLite does not have a date datatype. Date variables need to be stored as characters and should therefore be be listed as characters in the csv file. -
IMD_2014_csv
,IMD_2019_csv
andCCG_xlsx
Paths to files containing reference data to be merged.
Additional reference data that can be merged to each record currentlyy include the Index of Multiple Deprivation (IMD), 2015 and/or 2019 versions, and CCG identifiers. The files paths to the reference files should be supplied as arguments and will be joined on patient LSOA11. The csv files containing LSOA11-to-IMD mappings need to have a column name that starts with "LSOA code", a column name that contains "Index of Multiple Deprivation (IMD) Rank" and a column name that contains "Index of Multiple Deprivation (IMD) Decile". The lookup files for IMD 2015 and IMD 2019 can be downloaded from GOV.UK (File 7: all ranks, deciles and scores for the indices of deprivation, and population denominators). The lookup file for CCG identifiers can be downloaded from NHS Digital (File: X - Changes to CCG-DCO-STP mappings over time). -
update
Switch pipeline mode.
Pipeline mode is switched from BUILD to UPDATE mode by setting this argument toTRUE
. -
duplicate
Flagging duplicate records.
Additional columns will be created in the APC, A&E and OP dataset that indicitates whether or not a record is likely to be a duplicate if this argumet is set toTRUE
. The definition and derivation rules can be found in (derived_variables.md). Warning: this will significantly increase the run time of the pipeline. -
comorbiditees
Flagging comorbidities.
Additional columns will be created in the APC dataset, including flags for individual conditions and weighted and unweighted Charlson and Elixhauser scores if this argument is set toTRUE
(also see the documentaion of the R package comorbidity). In addition, the pipeline flags conditions related to frailty and calculates a custom frailty index (see ?).Warning: this will significantly increase the run time of the pipeline.
Currently the pipeline is designed to run in an RStudio session. From the R console compile the code:
> source("pipeline.R")
Then call pipeline()
, providing as arguments a path to the data directory, a
path to a directory for an SQLite database, a vector of dataset codes, a path
to a csv with expected columns, inlcuding dataset codes and data types, an
optional vector of the number of rows to be read at a time per datasets, and,
if required,and a boolean to enable coercion. The data will be processed and
written to the database. N.B. This is a slow process and takes up a fair amount
of memory to run.
Example run:
> pipeline(data_path = "/home/user/raw-data/", database_path = "/home/user/database-dir/", data_set_codes = c("APC", "AE", "CC", "OP"), chunk_sizes = c(2000000, 5000000, 2000000, 3000000), expected_headers_file = "/home/user/expected_columns.csv", IMD_15_csv = "IMD_2015_LSOA.csv", IMD_19_csv = "IMD_2019_LSOA.csv", CCG_xlsx = "xchanges-to-ccg-dco-stp-mappings-over-time.xlsx", coerce = TRUE, update = FALSE, duplicates = FALSE, comorbidities = FALSE)
For guides on how to query SQLite databases from R, for example see the RStudio tutorial Databases using R.
The database can be queried:
- By writing SQLite syntax and executing these queries in R using the DBI package
- By writing R dpyr syntax and using the SQL backend provided by dbplyr to translate this code into SQLite.
- more to be added.
library(tidyverse)
library(dbplyr)
library (DBI)
con <- dbConnect(RSQLite::SQLite(), paste0(database_path, "HES_db.sqlite"))
# List available tables
dbListTables(con)
# List available variables in the A&E table
dbListFields(con, "AE")
# Option 1: Query using dbplyr
# Select table
AE <- tbl(con, 'AE')
# Look at the first 5 rows
AE %>%
head() %>%
collect()
# Option 2: Query using SQL
dbGetQuery(con,'SELECT * FROM AE LIMIT 5')
dbDisconnect(con)
If you are using DBI, use the dbGetQuery()
function. Avoid using functions that could modify the underlying database, such as dbExecute()
, dbSendQuery()
or dbSendStatement()
.
- Fiona Grimm - @fiona_grimm - fiona-grimm
- Sebastian Bailey - @sseb231 - seb231
This project is licensed under the MIT License.