ukb_download_and_prep_template

This repository is no longer maintained. As a group, our workflows for preprocessing UK Biobank data now use the UK Biobank RAP. For example, see a demo workflow here.

WARNING: The history of this repository was altered on 2022-08-24. Ideally please delete any old copies you have of this repository and make a fresh clone. In particular, please do not submit pull requests using the old history.

WARNING: This template was written to work with data in formats provided directly by UK Biobank via the Showcase or Data Portal (traditional access route). It is unlikely to work correctly with data downloaded from the UKB RAP (new access route). For example, this bug has been reported. It should not be used with data downloaded from the RAP without suitable amendment.

IMPORTANT: If you used or are using a version of this repo from before 19.02.2021, an error in date processing may have caused wrongly assigned dates for health outcomes. Please re-download and re-process any data processed with addNewHES.py.

Detailed documentation is available here.

This was the in-development version and may require major changes and corrections - use at your own risk! Please share errors/bugs found, preferably directly on the GitHub page or by emailing rosemary.walmsley@bdi.ox.ac.uk.

Quickstart

This usage tutorial assumes you have downloaded and extracted a .csv file containing participant data and a hesin_all.csv file with health record data from UK Biobank. The download folder contains guidance on how to download these.

1. Installation

To use this repo, run:

$ git clone https://github.com/activityMonitoring/ukb_download_and_prep_template.git

This repo requires pandas and nltk. If you are using an Anaconda installation of Python, these are included. To install Anaconda, follow the instructions at https://docs.anaconda.com/anaconda/install/linux/. Otherwise, run:

$ pip install pandas
$ pip install nltk

Navigate to the repo:

$ cd ukb_download_and_prep_template

2. Relabelling and recoding a participant data `.csv` file

You should have a ukb12345.csv participant data file which looks something like this:

eid	31-0.0	34-0.0	54-0.0	...
4987419	0	1944	11016	...
2898413	0	1956	11009	...
1049655	1	1947	11010	...
1892589	1	1941	11011	...
2449164	1	1958	11010	...

The next step towards having ready-to-use data is to filter out some columns and parse the field IDs and categorical codes.

Steps

Auto-generate a columns.json file from the text file of field IDs (in the format used in download_participant_data):

$ python writeColumnsFile.py --columnsFile analysisCols.txt

Run:

$ python filterUKB.py ukb12345.csv -o outputFilename.csv

3. Adding Hospital Episode Statistics on particular diseases

We now add columns on disease diagnoses in hospital. You will need:

hesin_all.csv: this is a file containing Hospital Episode Statistics data for all participants.
icdGroups.json: this is a JSON file containing descriptions of required HES code.
An existing dataset input.csv (which might be outputFilename.csv from the last section).
If you want to define prevalent and incident disease, input.csv should also contain a date column which will be used to define this.

Then run:

$ python addNewHES.py input.csv hesin_all.csv output.csv icdGroups.json --incident_prevalent True --date_column 'name_of_date_column'

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
.github/workflows		.github/workflows
docs		docs
download		download
test		test
.gitignore		.gitignore
Codings_Showcase.csv		Codings_Showcase.csv
Data_Dictionary_Showcase.csv		Data_Dictionary_Showcase.csv
LICENSE		LICENSE
README.md		README.md
addNewHES.py		addNewHES.py
analysisCols.txt		analysisCols.txt
derivedColumns.json		derivedColumns.json
filterUKB.py		filterUKB.py
icdGroups.json		icdGroups.json
icdGroupsCVDParticular.json		icdGroupsCVDParticular.json
writeColumnsFile.py		writeColumnsFile.py

License

OxWearables/ukb_download_and_prep_template

Folders and files

Latest commit

History

Repository files navigation

ukb_download_and_prep_template

Quickstart

1. Installation

2. Relabelling and recoding a participant data .csv file

Steps

3. Adding Hospital Episode Statistics on particular diseases

About

Resources

License

Stars

Watchers

Forks

Languages

2. Relabelling and recoding a participant data `.csv` file