-
Notifications
You must be signed in to change notification settings - Fork 27
/
AncestryCheck.Rmd
270 lines (239 loc) · 11.1 KB
/
AncestryCheck.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
---
title: "Ancestry estimation based on reference samples of known ethnicities"
author: "Hannah Meyer"
date: "`r Sys.Date()`"
output:
pdf_document:
fig_caption: yes
toc: true
toc_depth: 2
highlight: pygments
bibliography: references.bib
csl: plos-genetics.csl
vignette: >
%\VignetteIndexEntry{AncestryCheck}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup knitr, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Ancestry estimation
The identification of individuals of divergent ancestry can be achieved by
combining the genotypes of the study population with genotypes of a reference
dataset consisting of individuals from known ethnicities (for instance
individuals from the Hapmap or 1000 genomes study [@HapMap2005,@HapMap2007,
@HapMap2010,@a1000Genomes2015,@b1000Genomes2015]). Principal component
analysis (PCA) on this combined genotype panel can then be used to detect
population structure down to the level of the reference dataset (for Hapmap and
1000 Genomes, this is down to large-scale continental ancestry).
In the following, the workflow for combining a study dataset with the
reference samples, conducting PCA and estimating ancestry is demonstrated.
The study dataset consists of 200 individuals and 10,000 genetic markers and
is provided with $plinkQC$ in `file.path(find.package('plinkQC'),'extdata')`.
# Workflow
## Download reference data
A suitable reference dataset should be downloaded and if necessary, re-formated
into PLINK format. Vignettes
['Processing HapMap III reference data for ancestry estimation'](https://meyer-lab-cshl.github.io/plinkQC/articles/HapMap.html) and
['Processing 1000Genomes reference data for ancestry estimation'](https://meyer-lab-cshl.github.io/plinkQC/articles/1000Genomes.html),
show the download and processing of the HapMap phase III and 1000Genomes phase
III dataset, respectively. In this example, we will use the HapmapIII data as
the reference dataset.
## Set-up
We will first set up some bash variables and create directories needed; storing
the names and directories of the reference and study will make it easy to use
updated versions of the reference or new datasets in the future.
Is is also useful to keep the PLINK log-files for future reference. In order to
keep the data directory tidy, we'll create a directory for the log files and
move them to the log directory here after each analysis step.
```{bash setup, eval=FALSE}
qcdir='~/qcdir'
refdir='~/reference'
name='data'
refname='HapMapIII'
mkdir -p $qcdir/plink_log
```
## Match study genotypes and reference data
In order to compute joint principal components of the reference and study
population, we'll need to combine the two datasets. The plink --merge
function enables this merge, but requires the variants in the datasets to be
matching by chromosome, position and alleles. The following sections show how
to extract the relevant data from the reference and study dataset and how to
filter matching variants.
### Filter reference and study data for non A-T or G-C SNPs
We will use an awk script to find A→T and C→G SNPs. As these SNPs are more
difficult to align and only a subset of SNPs is required for the analysis, we
will remove them from both the reference and study data set.
```{bash filter at and gc snps, eval=FALSE}
awk 'BEGIN {OFS="\t"} ($5$6 == "GC" || $5$6 == "CG" \
|| $5$6 == "AT" || $5$6 == "TA") {print $2}' \
$qcdir/$name.bim > \
$qcdir/$name.ac_gt_snps
awk 'BEGIN {OFS="\t"} ($5$6 == "GC" || $5$6 == "CG" \
|| $5$6 == "AT" || $5$6 == "TA") {print $2}' \
$refdir/$refname.bim > \
$qcdir/$refname.ac_gt_snps
plink --bfile $refdir/$refname \
--exclude $qcdir/$refname.ac_gt_snps \
--make-bed \
--out $qcdir/$refname.no_ac_gt_snps
mv $qcdir/$refname.no_ac_gt_snps.log $qcdir/plink_log/$refname.no_ac_gt_snps.log
plink --bfile $qcdir/$name \
--exclude $qcdir/$name.ac_gt_snps \
--make-bed \
--out $qcdir/$name.no_ac_gt_snps
mv $qcdir/$name.no_ac_gt_snps.log $qcdir/plink_log/$name.no_ac_gt_snps.log
```
### Prune study data
We will conduct principle component analysis on genetic variants that are
pruned for variants in linkage disequilibrium (LD) with an $r^2 >0.2$ in a 50kb
window. The LD-pruned dataset is generated below, using plink --indep-pairwise
to compute the LD-variants; additionally exclude range is used to remove genomic
ranges of known high-LD structure. This file was originally provided by
@Anderson2010 and is available in
`file.path(find.package('plinkQC'),'extdata','high-LD-regions.txt')`.
```{bash prune, eval=FALSE}
plink --bfile $qcdir/$name.no_ac_gt_snps \
--exclude range $refdir/$highld \
--indep-pairwise 50 5 0.2 \
--out $qcdir/$name.no_ac_gt_snps
mv $qcdir/$name.prune.log $qcdir/plink_log/$name.prune.log
plink --bfile $qcdir/$name.no_ac_gt_snps \
--extract $qcdir/$name.no_ac_gt_snps.prune.in \
--make-bed \
--out $qcdir/$name.pruned
mv $qcdir/$name.pruned.log $qcdir/plink_log/$name.pruned.log
```
### Filter reference data for the same SNP set as in study
We will use the list of pruned variants from the study sample to reduce the
reference dataset to the size of the study samples:
```{bash filter, eval=FALSE}
plink --bfile $refdir/$refname \
--extract $qcdir/$name.prune.in \
--make-bed \
--out $qcdir/$refname.pruned
mv $qcdir/$refname.pruned.log $qcdir/plink_log/$refname.pruned.log
```
### Check and correct chromosome mismatch
The following section uses an awk-script to check that the variant IDs of the
reference data have the same chromosome ID as the study data.
For computing the genetic PC, the annotation is not important, however, merging
the files via PLINK will only work for variants with perfectly
matching attributes. For simplicity, we update the pruned reference dataset.
Note, that sex chromosomes are often encoded differently and might make the
matching more difficult. Again, for simplicity and since not crucial to the final
task, we will ignore XY-encoded sex chromosomes (via `sed -n '/^[XY]/!p'`).
```{bash chromosome mismatch, eval=FALSE}
awk 'BEGIN {OFS="\t"} FNR==NR {a[$2]=$1; next} \
($2 in a && a[$2] != $1) {print a[$2],$2}' \
$qcdir/$name.pruned.bim $qcdir/$refname.pruned.bim | \
sed -n '/^[XY]/!p' > $qcdir/$refname.toUpdateChr
plink --bfile $qcdir/$refname.pruned \
--update-chr $qcdir/$refname.toUpdateChr 1 2 \
--make-bed \
--out $qcdir/$refname.updateChr
mv $qcdir/$refname.updateChr.log $qcdir/plink_log/$refname.updateChr.log
```
### Position mismatch
Similar to the chromosome matching, we use an awk-script to find variants with
mis-matching chromosomal positions.
```{bash position mismatch, eval=FALSE}
awk 'BEGIN {OFS="\t"} FNR==NR {a[$2]=$4; next} \
($2 in a && a[$2] != $4) {print a[$2],$2}' \
$qcdir/$name.pruned.bim $qcdir/$refname.pruned.bim > \
$qcdir/${refname}.toUpdatePos
```
### Possible allele flips
Unlike chromosomal and base-pair annotation, mismatching allele-annotations will
not only prevent the plink --merge, but also mean that it is likely that
actually a different genotype was measured. Initially, we can use the following
awk-script to check if non-matching allele codes are a simple case of allele
flips.
```{bash possible allele flips, eval=FALSE}
awk 'BEGIN {OFS="\t"} FNR==NR {a[$1$2$4]=$5$6; next} \
($1$2$4 in a && a[$1$2$4] != $5$6 && a[$1$2$4] != $6$5) {print $2}' \
$qcdir/$name.pruned.bim $qcdir/$refname.pruned.bim > \
$qcdir/$refname.toFlip
```
### Upate positions and flip alleles
We use plink to update the mismatching positions and possible allele-flips
identified above.
```{bash update and flip, eval=FALSE}
plink --bfile $qcdir/$refname.updateChr \
--update-map $qcdir/$refname.toUpdatePos 1 2 \
--flip $qcdir/$refname.toFlip \
--make-bed \
--out $qcdir/$refname.flipped
mv $qcdir/$refname.flipped.log $qcdir/plink_log/$refname.flipped.log
```
### Remove mismatches
Any alleles that do not match after allele flipping, are identified and removed
from the reference dataset.
```{bash mismatch, eval=FALSE}
awk 'BEGIN {OFS="\t"} FNR==NR {a[$1$2$4]=$5$6; next} \
($1$2$4 in a && a[$1$2$4] != $5$6 && a[$1$2$4] != $6$5) {print $2}' \
$qcdir/$name.pruned.bim $qcdir/$refname.flipped.bim > \
$qcdir/$refname.mismatch
plink --bfile $qcdir/$refname.flipped \
--exclude $qcdir/$refname.mismatch \
--make-bed \
--out $qcdir/$refname.clean
mv $qcdir/$refname.clean.log $qcdir/plink_log/$refname.clean.log
```
## Merge study genotypes and reference data
The matching study and reference dataset can now be merged into a combined
dataset with plink --bmerge. If all steps outlined above were conducted
successfully, no mismatch errors should occur.
```{bash merge, eval=FALSE}
plink --bfile $qcdir/$name.pruned \
--bmerge $qcdir/$refname.clean.bed $qcdir/$refname.clean.bim \
$qcdir/$refname.clean.fam \
--make-bed \
--out $qcdir/$name.merge.$refname
mv $qcdir/$name.merge.$refname.log $qcdir/plink_log
```
## PCA on the merged data
We can now run principal component analysis on the combined dataset using
plink --pca which returns a .eigenvec file with the family and individual ID
in columns 1 and 2, followed by the first 20 principal components.
```{bash pca, eval=FALSE}
plink --bfile $qcdir/$name.merge.$refname \
--pca \
--out $qcdir/$name.$reference
mv $qcdir/$name.$reference.log $qcdir/plink_log
```
## Check ancestry
We can use the .eigenvec file to estimate the ancestry of the study samples.
Identifying individuals of divergent ancestry is implemented in
`check_ancestry`. Currently, check ancestry only supports automatic selection of
individuals of European descent. It uses principal components 1
and 2 to find the center of the known European reference samples. All study
samples whose Euclidean distance from the centre falls outside the radius
specified by the maximum Euclidean distance of the reference samples multiplied
by the chosen `europeanTh` are considered non-European. `check_ancestry` shows
the result of the ancestry analysis in a scatter plot of PC1 versus
PC2 colour-coded for samples of the reference populations and the study
population. From within R, run the following command to the ancestry check:
```{r check ancestry, eval=FALSE, fig.height=3, fig.width=5, fig.align='center'}
library(plinkQC)
indir <- system.file("extdata", package="plinkQC")
name <- 'data'
refname <- 'HapMapIII'
prefixMergedDataset <- paste(name, ".", refname, sep="")
exclude_ancestry <-
evaluate_check_ancestry(indir=indir, name=name,
prefixMergedDataset=prefixMergedDataset,
refSamplesFile=paste(indir, "/HapMap_ID2Pop.txt",
sep=""),
refColorsFile=paste(indir, "/HapMap_PopColors.txt",
sep=""),
interactive=TRUE)
```
```{r load ancestry, out.width = "500px", echo=FALSE, fig.align='center'}
knitr::include_graphics("checkAncestry.png")
```
# References