The genealogy of Arabidopsis thaliana (NSF DEB 0115062)

Ümit Seren edited this page Mar 10, 2015 · 2 revisions

Introduction

The goal of this project (a collaboration between the Bergelson, Kreitman, and Nordborg labs, as well as the [now-defunct] Genaissance Pharmaceuticals, Inc) was to sequence roughly 1,500 short fragments in a panel of 96 lines using standard PCR-based dideoxy sequencing. The main rationale for the project was investigating the feasibility of genome-wide association studies by describing population structure and linkage disequilibrium in A. thaliana. The project was thus analogous to first phase of the International HapMap Project. This page summarizes the main results from the study.

The data

The panel of 96 accessions is available from the stock centers (CS22564-CS22659), and the 1,214 annotated sequence alignments generated by the project can be downloaded here. The old website generated for this project was never very useful and is now broken. We have plans to develop a much improved database that would ultimately contain all the polymorphism data currently being generated plus data for related species like A. lyrata. The bulk of the SNPs have also been submitted to TAIR. The data have proved extremely useful in many ways, some anticipated, some not.

High-quality sequence data

Our manually curated, high-quality dideoxy sequencing data played a very important role as quality control in analyzing the (much noisier) Perlegen re-sequencing data. Because a subset of the 96 accessions were used in the Perlegen study, it was possibly to calibrate the base-calling algorithms very accurately (Clark et al, 2007).

High-quality SNPs

Even though the number of SNPs generated by this project are dwarfed by the recently generated Perlegen data, they are nonetheless sufficiently dense for many uses, in particular linkage mapping. Several studies from multiple labs (Borevitz, Koornneef, Weigel, etc) utilizing markers from this study are underway, and several groups have developed software to select markers for particular crosses (e.g., MSQT, MarkerTracker).

Population structure

Contrary to early studies, the data revealed clear population structure and isolation by distance on a global scale (Nordborg et al., 2005). There was tremendous variation among regions in the amount of local population structure. For example, while populations in northern Sweden seemed to be quite distinct, populations in most other regions appeared to be much more freely mixing. North American populations showed all signs of having been recently introduced from Europe via a small number of founders. Finally, in spite of being highly selfing, A. thaliana is far from a collection of isolated lineages. Most alleles were shared world-wide, and there was often considerable variation even within local patches (Bakker et al, 2006). Recombination was evident on all scales.

Detecting selection

The data have proven to be valuable as a form of genomic control when testing for selection. By comparing the pattern of polymorphism at particular loci suspected of having been subject to selection with our genome-wide data, it is possibly to establish rigorously that the former are, in a genomic sense, unusual. Using this approach, Toomajian et al. (2006) established that early-flowering alleles of the vernalization response locus FRI have been affected by a recent selective sweep, and Bakker et al. (2006) demonstrated that R genes, as a class, have been affect by some form of balancing selection.

Association mapping

These successes notwithstanding, the project was somewhat disappointing in that we found that linkage disequilibrium decayed much faster than previous results (Nordborg et al., 2002) had led us to believe, within 25 kb rather than within 250 kb (Nordborg et al., 2005). This meant that the marker density in the study, roughly one sequenced locus every 100 kb, was not sufficient to describe the genome-wide structure of linkage disequilibrium or carry out genome-wide association mapping (Aranzana et al., 2005). The data were sufficient for exploring the feasibility of genome-wide association mapping, however. We have in particular focused on the problem of confounding by population structure: spurious genotype-phenotype correlations that arise simply because both genotype and phenotypes are correlated with underlying structure. In a series of papers, we have demonstrated that this problem can be very serious, but that reasonably effective statistical remedies exist (Aranzana et al., 2005; Zhao et al., 2007).

In the long run, the fact that linkage disequilibrium decays more rapidly than originally believe is of course excellent news, because it means that association mapping will have higher resolution. The continuation of this study was designed to have a much higher marker density (250,000 SNPs, or one SNP every 500 bp), and also a much larger sample (over 1,000 lines) that includes more homogeneous regional samples to help overcome confounding by population structure. Meanwhile, the original sample of 96 lines are being phenotyped by large number of labs if for no other reason than to establish a baseline for variability in a given trait.

Original project abstract

The entire 130 million base pair genome of the plant Arabidopsis thaliana was finished last year. The objective of this project is to leverage the genome sequence to catalog the naturally occurring genetic variation in the species. The project is based on the theoretical insight that, in highly self-fertilizing organisms, like A. thaliana, it should be possible to create such a catalog very efficiently by looking at the pattern of variation in a number of small segments distributed over the genome. Rather than sequencing the entire genome of one additional individual, one should sequence 1% of the genome in 100 individuals. Specifically, the project will sequence 1500–2000 chromosomal segments of length 500–700 base pairs, distributed over the genome, in a sample of 96 carefully selected individuals. The data will be publicly available through GenBank, as well as through a highly flexible relational database developed specifically for this purpose. The database will be equipped with web-based bioinformatics tools to query it, and will be continuously updated. The project represents the first serious attempt to describe the genomic variation in a species. It is highly relevant to the objectives of the 2010 project in a general sense, because it will not be possible to "determine the function of all genes [...] within their cellular, organismal, and evolutionary contexts" without understanding how genetic variation is structured in the species. More immediately, the database will be an invaluable resource for plant geneticists interested in finding the genes responsible for variation in agriculturally important traits such as drought tolerance. In this respect, the project should be compared to the large databases of human variation that are currently being created to aid genetic epidemiology. The tools and methods created for this project will also be directly applicable to several organisms of direct economic importance, such as rice and barley. Finally, the database will serve as a very important training tool for students in computational and evolutionary biology, and in statistical genetics.