-
Notifications
You must be signed in to change notification settings - Fork 27
/
HapMap.Rmd
157 lines (135 loc) · 6.22 KB
/
HapMap.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
title: "Processing HapMap III reference data for ancestry estimation"
author: "Hannah Meyer"
date: "`r Sys.Date()`"
output:
pdf_document:
fig_caption: yes
toc: true
toc_depth: 2
keep_tex: true
highlight: pygments
bibliography: references.bib
csl: plos-genetics.csl
vignette: >
%\VignetteIndexEntry{HapMap}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup knitr, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
Genotype quality control for genetic association studies often includes the need
for selecting samples of the same ethnic background.
To identify individuals of divergent ancestry based on genotypes, the genotypes
of the study population can be combined with genotypes of a reference dataset
consisting of individuals from known ethnicities.
Principal component analysis (PCA) on this combined genotype panel can then be
used to detect population structure down to the level of the reference dataset.
The following vignette shows the processing steps required to use samples of the
HapMap study [@HapMap2005][@HapMap2007][@HapMap2010] as a reference dataset.
Using this reference, population structure down to large-scale continental
ancestry can be detected. A step-by-step instruction on how to conduct this
analysis is described in this
[Ancestry estimation vignette](https://meyer-lab-cshl.github.io/plinkQC/articles/AncestryCheck.html).
# Workflow
## Set-up
We will first set up some bash variables and create directories needed; storing
the names and directories of the reference will make it easy to use
updated versions of the reference in the future.
Is is also useful to keep the PLINK log-files for future reference. In order to
keep the data directory tidy, we'll create a directory for the log files and
move them to the log directory here after each analysis step.
```{bash setup, eval=FALSE}
refdir='~/reference'
mkdir -r $qcdir/plink_log
```
## Download and convert Hapmap phase III data
Hapmap phase 3 data (HapMapIII) is available in
[PLINK text format](https://www.cog-genomics.org/plink/1.9/input#ped) at ncbi.
In addition, a sample file with information about the individuals' ancestry
is available and should be downloaded as in input for `plinkQC::chec_Ancestry()`.
The following code chunk downloads and unzips the data.
```{bash download, eval=FALSE}
cd $refdir
ftp=ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2009-01_phaseIII/plink_format/
prefix=hapmap3_r2_b36_fwd.consensus.qc.poly
wget $ftp/$prefix.map.bz2
bunzip2 $prefix.map.bz2
wget $ftp/$prefix.ped.bz2
bunzip2 $prefix.per.bz2
wget $ftp/relationships_w_pops_121708.txt
```
We then convert the
[PLINK text format](https://www.cog-genomics.org/plink/1.9/input#ped) to the
standardly used
[PLINK binary format](https://www.cog-genomics.org/plink/1.9/inputbped).
```{bash convert, eval=FALSE}
plink --file $refdir/$prefix \
--make-bed \
--out $refdir/HapMapIII_NCBI36
mv $refdir/HapMapIII_NCBI36.log $refdir/log
```
## Update annotation
The genome build of HapMap III data is NCBI36. Currently most datasets are
updated to CGRCh37 or CGRCh38. In order to update the HapMap III data to the
desired build, we use the UCSC
[liftOver](https://genome.ucsc.edu/cgi-bin/hgLiftOver) tool. The liftOver tool
takes information in a format similar to the [PLINK .bim](https://www.cog-genomics.org/plink/1.9/formats#bim) format, the [UCSC bed format](https://genome.ucsc.edu/FAQ/FAQformat.html#format1) and a
[liftover chain](http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/),
containing the mapping information between the old genome (target) and new
genome (query).
It returns the updated annotation (newFile) and a file with unmappable variants
(unMapped):
```{bash liftover example, eval=FALSE}
liftOver oldFile liftover.chain newFile unMapped
```
We first need to download the liftOver tool from
https://genome.ucsc.edu/cgi-bin/hgLiftOver and the appropriate liftover chain from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/). We then convert the
[PLINK .bim](https://www.cog-genomics.org/plink/1.9/formats#bim) format, to the
zero-based [UCSC bed](https://genome.ucsc.edu/FAQ/FAQformat.html#format1)
format.
Hapmap chromosome data is encoded numerically, with chrX represented by chr23,
and chrY as chr24. In order to match to data encoded by chrX and chrY, we will
have to rename these hapmap chromosomes. Converting to zero-based UCSC format
and re-coding chromosome codes can be achieved by:
```{bash prepare liftover, eval=FALSE}
awk '{print "chr" $1, $4 -1, $4, $2 }' $refdir/HapMapIII_NCBI36.bim | \
sed 's/chr23/chrX/' | sed 's/chr24/chrY/' > \
$refdir/HapMapIII_NCBI36.tolift
```
[Note: In the official HapMap release, chromosome codes described above, however
in the original download files (link above), no chr24 detected. I will keep this
line in for completeness, but note, when inspecting file that no chr24/chrY are
present.]
We use the liftOver tool and the UCSC bed formated annotation file together
with the appropriate chain file to do the lift over.
```{bash liftover, eval=FALSE}
liftOver $refdir/HapMapIII_NCBI36.tolift $refdir/hg18ToHg19.over.chain \
$refdir/HapMapIII_CGRCh37 $refdir/HapMapIII_NCBI36.unMapped
```
After successful liftover, we will be able to extract i) the variants that were
mappable from the old to the new genome and ii) their updated positions
```{bash extract variants, eval=FALSE}
# ectract mapped variants
awk '{print $4}' $refdir/HapMapIII_CGRCh37 > $refdir/HapMapIII_CGRCh37.snps
# ectract updated positions
awk '{print $4, $3}' $refdir/HapMapIII_CGRCh37 > $refdir/HapMapIII_CGRCh37.pos
```
## Update the reference data
We can now use PLINK to extract the mappable variants from the old build and
update their position. After these steps, the HapMap III dataset can be used
for inferring study ancestry as described in the corresponding [Ancestry estimation vignette](https://meyer-lab-cshl.github.io/plinkQC/articles/AncestryCheck.html).
```{bash update annotation, eval=FALSE}
plink --bfile $refdir/HapMapIII_NCBI36 \
--extract $refdir/HapMapIII_CGRCh37.snps \
--update-map $refdir/HapMapIII_CGRCh37.pos \
--make-bed \
--out $refdir/HapMapIII_CGRCh37
mv $refdir/HapMapIII_CGRCh37.log $refdir/log
```
# References