Merge overleaf-2022-05-18-1558 into main

DivyaratanPopli · May 18, 2022 · b3823c0 · b3823c0
2 parents 69391e0 + aa7ba23
commit b3823c0
Show file tree

Hide file tree

Showing 2 changed files with 50 additions and 4 deletions.
diff --git a/paper_overleaf/main.tex b/paper_overleaf/main.tex
@@ -53,7 +53,7 @@ \section{Introduction}
 
 \subsection{Why study relatedness?}
 
-Identifying related individuals is a common task in genetic studies. Relatedness is of direct interest in e.g. DNA forensics, where familial search can aid in solving criminal cases, and to identify unknown deceased persons \cite{murphy_law_2018,ram_genealogy_2018}. Genetic paternity tests have an important application in resolving family relation, e.g. in establishing relationship between a person applying for immigration and the claimed relatives \cite{egeland_beyond_2000}. It is also an essential step in population genetics and association studies, where samples are typically assumed to be independent random draws from the population. For animal and plant breeders and conservation biologists, reconstructing pedigrees and finding related individuals is important to avoid inbreeding and ensure diversity \cite{habier_impact_2007,oliehoek_estimating_2006,kardos_measuring_2015}.
+Identifying related individuals is a common task in genetic studies. Relatedness is of direct interest in e.g. DNA forensics, where familial search can aid in solving criminal cases, and to identify unknown deceased persons \cite{murphy_law_2018,ram_genealogy_2018}. Genetic paternity tests have an important application in resolving family relation, e.g. in establishing relationship between an individual applying for immigration and the claimed relatives \cite{egeland_beyond_2000}. It is also an essential step in population genetics and association studies, where samples are typically assumed to be independent random draws from the population. For animal and plant breeders and conservation biologists, reconstructing pedigrees and finding related individuals is important to avoid inbreeding and ensure diversity \cite{habier_impact_2007,oliehoek_estimating_2006,kardos_measuring_2015}.
 
 In ancient DNA studies, relatedness can be used to identify bones and teeth belonging to the same individual. Given adequate familiarity with the subject, relatedness can provide an understanding of an ancient society's social structures, mobility and inheritance rules ~\cite{baca_ancient_2012,mittnik_kinship-based_2019,sikora_ancient_2017}.
 
@@ -62,7 +62,7 @@ \subsection{Approaches to estimate relatedness from high-coverage data}
 
 Commonly, pairs of related individuals are identified by looking for parts of the genome that are identical by descent (IBD), i.e. inherited from a recent common ancestor. Due to the laws of Mendelian segregation, each parent will share exactly one set of chromosomes IBD with their offspring, while subsequent recombination means that a grandparent will, on average share a quarter of their genome with a grand-child. Along the genomes of a pair of diploid individuals, there are three IBD states possible at any given position: the individuals share zero, one or two chromosomes IBD. The genome-wide proportions of these states (usually referred to as $k_0$, $k_1$, $k_2$, so that $k_0+k_1+k_2=1$) can be used to infer the degree and nature of relatedness for a pair of individuals. For example, a pair of siblings are expected to have all three possible IBD states with proportions of 0.25,0.5,0.25, respectively (Fig. \ref{fig0:schematic}). These IBD probabilities can directly be used to categorize their relatedness as shown in table \ref{tab:Table 1}. One can also use these probabilities to estimate the coefficient of relatedness $r$, which is defined as the proportion of the genome that is IBD. In the absence of inbreeding, this would be calculated as $r= k_1/2 + k_2$.
 
-However, since it is not possible to directly observe IBD segments, a common approach is to first identify segments of the genome that are Identical by State (IBS) and to use population allele frequencies obtained from an out-of-sample reference panel to calculate the probability of IBD given IBS. There are several methods that incproporate reference panel allele frequencies, phase information, recombination maps, or genotype calls to co-estimate IBD and the relatedness coefficient \cite{huff_maximum-likelihood_2011,li_relationship_2014,thornton_estimating_2012, boehnke_accurate_1997,lynch_estimation_1999, albrechtsen_natural_2010, purcell_plink_2007,manichaikul_robust_2010,gusev_whole_2009,nyerki_optimized_2022}.
+However, since it is not possible to directly observe IBD segments, a common approach is to first identify segments of the genome that are Identical by State (IBS) and to use population allele frequencies obtained from an out-of-sample reference panel to calculate the probability of IBD given IBS. There are several methods that incproporate reference panel allele frequencies, phase information, recombination maps, or genotype calls to co-estimate IBD and the relatedness coefficient \cite{huff_maximum-likelihood_2011,li_relationship_2014,thornton_estimating_2012, boehnke_accurate_1997,lynch_estimation_1999, purcell_plink_2007,manichaikul_robust_2010,gusev_whole_2009,nyerki_optimized_2022,browning_fast_2011,li_accurate_2014}.
 
 \begin{figure}[h!]
     \includegraphics[width=18cm]{plots/inkscape_finalImg/schematic_sib.png}
@@ -404,7 +404,7 @@ \section{Discussion}\label{discussion}
 
 The Lech Valley data has low contamination, and no ROH. For pairwise comparisons with large numbers of overlapping sites ($>10000$), KIN, READ and lcMLkin all mostly agree. However, KIN is able to differentiate between parent-child and siblings, and identify second degree relationship from just a few thousand polymorphic sites ($\approx$ 4000) overlapping between samples. KIN can also infer third degree relation with $\approx$ 30,000 overlapping polymorphic sites. We show that when applied to Neandertal specimens from Chagyrskaya and Okladnikov Caves, KIN identifies a pair of $1^{st}$-Degree relatives as parent-child, which is in agreement with the finding that the mtDNA haplotypes differ between the samples \cite{laurits_skov_genetic_nodate}. In addition, KIN identifies a pair of $3^{rd}$ degree relatives. In this case of a population with large amounts of ROH, we find that the inference by lcMLkin are heavily biased, but KIN's model takes ROH into account and both the coefficient of relatedness and $k_0$ are very close to what would be expected from the inference by both READ and KIN. 
 
-One limitation of our approach is that it assumes a single population. In case of a highly structured population, KIN may show inaccurate inference of $p_0$ causing inaccurate relatedness inference. Also, our method makes the assumption, that the median pairwise genetic difference in the population reflects the population diversity $p_0$, which fails if almost all individuals in the dataset are related. The current implementation of KIN is restricted to six relatedness cases we expect to be most common, but it might be feasible to extend it to other cases, such as double first cousins, using a corresponding IBD state transition matrix.
+One limitation of our approach is that it assumes a single population. In case of a highly structured population, KIN may show inaccurate inference of $p_0$ causing inaccurate relatedness inference. Also, our method makes the assumption, that the median pairwise genetic difference in the population reflects the population diversity $p_0$, which fails if almost all individuals in the dataset are related. We may get around this problem by using an estimate of $p_0$, calculated from known a pair of unrelated individuals from same population, or another population with similar diversity. The current implementation of KIN is restricted to six relatedness cases we expect to be most common, but it might be feasible to extend it to other cases, such as double first cousins, using a corresponding IBD state transition matrix.
 
 While we have focused on the application of KIN on ancient human samples, the model is not tied to this system. Assuming we know the recombination rate, and hence can estimate the transition matrix (see section \ref{method}), KIN can be widely applied to any diploid species. In addition, the output of KIN is a table which shows for each pair, the most likely model, and the second best guess, along with a confidence level represented by the log likelihood ratio. This makes KIN easy to automatize for large datasets. To make application of KIN user-friendly, we provide a python package (KINgaroo) to create input files for KIN from processed bamfiles, while optionally estimating ROH, and correcting for contamination estimates.
 
@@ -463,9 +463,20 @@ \section{Acknowledgements}
 \section{Contributions}
 Conceptualization (Design of study): B.M.P.; Software: D.P.; Methodology—lead: D.P.; Methodology—support: B.M.P., S.P.; Formal Analysis: D.P.; Visualization-lead: D.P.; Visualization-support: S.P.; Data Curation: D.P.; Writing—lead: D.P.; Writing—support: B.M.P., S.P.; Supervision: B.M.P.
 
-\section{Competing interests}
+\section{Data and material availability}
+An open-source implementation of KIN and KINgaroo in python along with a toy example dataset, and the scripts to generate our test simulations are available on GitHub \url{https://github.com/DivyaratanPopli/Kinship_Inference}. We have deposited the version of software used in the manuscript on Zenodo (). The analysed datasets from Bronze Age Lech Valley were generated in a previous study \cite{mittnik_kinship-based_2019}. Chagyrskaya and Okladnikov dataset were generated from a study currently in press \cite{laurits_skov_genetic_nodate}, and the data will be uploaded to European Nucleotide Archive upon publication. 
+
+\section{Ethics declarations}
+\subsection{Ethics approval and consent to participate}
+Not applicable.
+
+\subsection{Consent for publication}
+Not applicable.
+
+\subsection{Competing interests}
 The authors declare that they have no competing interests.
 
+
 \bibliographystyle{plain}
 \bibliography{references.bib}
 

diff --git a/paper_overleaf/references.bib b/paper_overleaf/references.bib
@@ -1,4 +1,39 @@
 
+@article{browning_fast_2011,
+	title = {A {Fast}, {Powerful} {Method} for {Detecting} {Identity} by {Descent}},
+	volume = {88},
+	issn = {0002-9297},
+	url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035716/},
+	doi = {10.1016/j.ajhg.2011.01.010},
+	abstract = {We present a method, fastIBD, for finding tracts of identity by descent (IBD) between pairs of individuals. FastIBD can be applied to thousands of samples across genome-wide SNP data and is significantly more powerful for finding short tracts of IBD than existing methods for finding IBD tracts in such data. We show that fastIBD can detect facets of population structure that are not revealed by other methods. In the Wellcome Trust Case Control Consortium bipolar disorder case-control data, we find a genome-wide excess of IBD in case-case pairs of individuals compared to control-control pairs. We show that this excess can be explained by the geographical clustering of cases. We also show that it is possible to use fastIBD to generate highly accurate estimates of genome-wide IBD sharing between pairs of distant relatives. This is useful for estimation of relationship and for adjusting for relatedness in association studies. FastIBD is incorporated in the freely available Beagle software package.},
+	number = {2},
+	urldate = {2022-05-17},
+	journal = {American Journal of Human Genetics},
+	author = {Browning, Brian L. and Browning, Sharon R.},
+	month = feb,
+	year = {2011},
+	pmid = {21310274},
+	pmcid = {PMC3035716},
+	pages = {173--182},
+}
+
+@article{albrechtsen_relatedness_2009,
+	title = {Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium},
+	volume = {33},
+	issn = {1098-2272},
+	doi = {10.1002/gepi.20378},
+	abstract = {Estimates of relatedness have several applications such as the identification of relatives or in identifying disease related genes through identity by descent (IBD) mapping. Here we present a new method for identifying IBD tracts among individuals from genome-wide single nucleotide polymorphisms data. We use a continuous time Markov model where the hidden states are the number of alleles shared IBD between pairs of individuals at a given position. In contrast to previous methods, our method accurately accounts for linkage disequilibrium using pairwise haplotype probabilities. The method provides a map of the local relatedness along the genome. We illustrate the potential of the method for mapping disease genes on a real data set, and show that the method has the potential to map causative disease mutations using only a handful of affected individuals. The new IBD mapping method provides considerable improvement in mapping power in natural populations compared to standard association mapping methods.},
+	language = {eng},
+	number = {3},
+	journal = {Genetic Epidemiology},
+	author = {Albrechtsen, Anders and Sand Korneliussen, Thorfinn and Moltke, Ida and van Overseem Hansen, Thomas and Nielsen, Finn Cilius and Nielsen, Rasmus},
+	month = apr,
+	year = {2009},
+	pmid = {19025785},
+	keywords = {Chromosome Mapping, Genome-Wide Association Study, Humans, Linkage Disequilibrium, Markov Chains, Polymorphism, Single Nucleotide},
+	pages = {266--274},
+}
+
 @article{green_draft_2010,
 	title = {A {Draft} {Sequence} of the {Neandertal} {Genome}},
 	volume = {328},