Skip to content

Population genomics of european hedgehogs across contact zones.

License

Notifications You must be signed in to change notification settings

IgnasiLucas/hedgehog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

2021-12-21

Reproduced with NGSadmix the distribution of ancestries with K=6 obtained before with Admixture.

See the report here

2021-12-08

I run PCAngsd and NGSadmix. PCAngsd separates E. europaeus from E. roumanicus in the first component, which explains almost 90% of the variation. The second component differentiates E. concolor from E. roumanicus. The original misassignment of two E. roumanicus individuals to the E. concolor population is evident, and requires the exclusion of those samples from analysis, because we miss geographic data from them. The two E. amurensis cluster together, well separated from the other species.

Different runs give slightly different results in NGSadmix. With K=6 I can reproduce the separation of species, the distinction of two lineages in E. europaeus and in E. roumanicus, and the presence of hybrids or introgressed individuals. Results do not seem stable, though, with particular individuals apparently assigned to unexpected groups. It may require some more attention.

See the report here

2021-12-06

First try using ANGSD. It's terribly badly documented, but the methods implemented are very good. I managed to explore the allele frequencies and SFS. A dilema: whether to use SNPs pre-filtered in VCFs or to use bam files directly.

See the report here

2021-12-01

Using only forward reads does produce a more complete genotypes matrix, but in the end the number of sites retained is larger when using both forward and reverse reads. Thus, I stick to the original mapping of reads.

Here I also revise the strategy for filtering individuals. I lost track of why individual samples were removed from the data set in the last filtering, when a file 'popmap_erca_only_good.txt' was used to keep only 75 out of 90 individuals. It turns out that a few of the individuals removed back then do not really have low numbers of SNPs covered. Here I apply a different filtering strategy and remove the 15 samples with strickly lower number of SNPs covered at least 8 times in sites with at least 75% of samples genotyped. I end up retaining a larger fraction of sites, and decide to move on with this filtering.

I produce two VCF with the best 75 samples, one with 175735 sites with at least 75% of samples covered at least 6 times, and one with 29252 sites with at least 75% of samples covered at least 8 times.

See the report here

2021-11-22

I run again freebayes using only forward reads from all samples, with the hope that all DNA fragments are oriented. The RADseq protocol does produce asymmetric fragments, with one end delimited by the enzyme recognition site, and the other by the random fragmentation of the DNA. Thus, reads produced from the enzyme recognition site (forward reads, I think) must be alignable among them, across all samples. Reverse reads, on the contrary, are expected to map at variable distances from the cut site, and do not form uniform clusters. This must be a source of SNPs that are not shared among samples. Plus, the thinning algorithm could be wasting some good SNPs in forward reads, while casually picking up SNPs from nearby reverse reads. Therefore, it is necessary to run freebayes again, but only with forward reads.

See the report here

2021-11-17

While trying to understand why we are left with so few loci covered in a large enough subset of samples, I realize that samples sequenced in the second batch are the ones missing most genotypes. The reason must be that those were sequenced with only single reads.

See the report here

2021-11-16

Started working to address the reviewers' comments. I realize that to use ANGSD, which uses genotype likelihoods it may be important to estimate the likelihoods correctly, by informing freebayes of the morphological classification of samples. Thus, I run freebayes again.

See the report here

2020-12-11

Used a new script, freq3.awk, to estimate derived allele frequencies in populations without assuming a particular phylogeny among them. I think the goal in that folder was to explore the jackknive approach to D statistics, but I got distracted and I left it there.

2020-09-28

Obtaining D and f statistics from systematic comparisons among populations.

2020-07-22

Another attempt to prepare the data for submission, dividing it up in more files, to respect the file size limit in Dyrad.

2020-07-19

Preparation of metadata for submission to a public repository.

2020-07-14

Preparation of data for submission to a public repository.

2020-06-26

Just another run of Bayescan to confirm that it overestimates the Fst values. It may be due to low number of individuals and populations.

2020-06-15

The unexpected relationships between diversit and divergence were originally observed using data from quite large sliding windows. The traditional MKA test assumes recombination did not happen within the alignment where diversity and divergence are estimated. Here I analyse thousands of short alignments from mapped reads, an I use only divergence and diversity estimates from within those alignments. Visualizing the pattern and testing for unexpected deviations is still challenging, mainly because of the short range of values and their very skewed distribution.

2020-05-06

Checking that the BayeScan run was correct. It was indeed. The only little problem was in a script that produced an unnecessary plot. Nothing remarkable. See report here

2020-04-09

Another potential source of bias is the reference genome: where divergence is higher, coverage in the non-reference species is expected to be lower, due to reads that fail to map. After taking depth of coverage into account, the non-neutral relationships persist. Thus, reference bias is not a spurious cause of the unexpected relationships.

2020-04-08

The unexpected non-neutral pattern between diversity in either E. europaeus or E. concolor and their divergence triggered the alarm for artifacts. Here I check two potential sources of bias: the genomic window definition, and the unequal sample sizes among populations. Neither of them is causing the unexpected pattern. I keep searching.

2020-04-07

The genome-wide negative correlation between genetic diversity in E. roumanicus and its level of divergence from E. europaeus demanded more attention. Here I include the third species, E. concolor, in the analysis. As expected under neutrality, within species diversity is positively correlated with divergence between E. roumanicus and E. concolor. This would support the idea that the negative correlation observed when comparing E. roumanicus with E. europaeus is due to introgression. However, the comparison between E. europaeus and E. concolor is not neutral (absence of correlation) and cannot be explained by introgression.

2020-02-18

Upon Kristyna's suggestion, I check if bin/vcf2MK_2.awk is counting polymorphic and divergent synonymous and missense sites correctly, for the MK test. It turns out that the script was not reporting any synonymous polymorphic site at either the lowest or the highest frequency classes. The bug was unlikely to bias the asymptotic estimate of alpha, but was corrected. Most asymptotic estimates of alpha are negative but noisy, with 95% CI always overlapping 0. Estimates in E. roumanicus are higher.

2020-02-12

New data came from the second contact zone in Russia. I run some checks to supervise the results from Kristyna's pipeline. The number of ambiguous mappings cannot be reduced much by changing the "very-sensitive" setting to just "sensitive". And the filtering step can hardly be improved either. See the report here.

2019-03-04

Here, I modify the script bin/vcf2MK_2.awk to include outgroup information and determine what is the ancestral allele. Results are comparable to those in 2019-03-01. Later, I tried to optimize the number of sites counted by requiring no more than one genotyped individual from the sister species (to determine divergence). But the effort did not improve much the results. The asymptotic estimates of alpha are calculated locally with the R script provided by Messer and Petrov (2013). See the results here

2019-03-01

Once the variants are annotated, I need to extract the information I need to run the McDonald-Kreitman test, namely, levels of synonymous and non-synonymous polymorphism and fixed differences. For this purpose, I wrote the script vcf2MK.awk, in folder bin/, which takes advantage of the binary presence flag and the annotation flag in the vcf. The results here are not quite right, because I did not identify the derived allele.

2019-02-27

The analyses from 2018-10-29 revealed a negative relationship between divergence (from E. europaeus) and diversity in E. roumanicus, using window-based estimates. Since one possible explanation is that positive selection in E. roumanicus drove divergence up and diversity down in some regions of the genome, we decided to run a genome-wide McDonald-Kreitman test of neutrality in E. roumanicus. For that, we need to annotate the variants with their functional consequence, including if they cause synonymous or non-synonymous changes in protein-coding genes. In this folder, I use the program snpEff to annotate all variants.

2019-01-30

Along a discussion about future sequencing strategies, the question of whether paired end reads are worth came up. In principle, paired-end reads improve mappability, for they extend the total length of mappable sequence. The paired end option is 2 * 100 base pairs, while single read options are either 100 or 150 bp long. In this folder I compare the uniqueness of 100, 150 and 200-bp long words in the hedgehog's reference genome. The improvement in uniqueness (mappability) between 150 and 200 bp words is very marginal. This convinces me that single end reads are sufficient.

2018-12-15

It was clear from previous analyses that introgressed variation is not distributed randomly along the genome in E. roumanicus. In order to identify introgressed variants that have a higher (or lower) chance of lingering, Barbora suggested to estimate genomic clines, using the Bayesian approach implemented in bgc by Gompert and Buerkle (2012). For that, I first identify six individuals with varying degrees of admixture between E. roumanicus and E. europaeus. Then, I run bgc using those six individuals as the admixed population, and the rest as parental populations of either species. Unfortunately, the admixed population is too small to notice any deviation of introgressed allele frequencies with respect to the expectation of the genome-wide hybrid index.

2018-10-29

Here I use Simon H. Martin's scripts to calculate population genetics statistics in sliding windows. I also use the annotation of the reference genome to estimate those statistics separately in genic and intergenic regions. I create some interesting plots and found an unexpected negative relationship between diversity in E. roumanicus and divergence between E. roumanicus and E. europaeus.

2018-10-24

Here I repeat the analysis from 2018-10-18, but with a different admixed individual, Er55_AU7, which has only a small percentage of E. europaeus ancestry. Thus, its specific combination of admixed and non-admixed genomic regions has actually survived more generations and is more likely to be the result of natural selection.

Statistic Admixed Not_admixed Difference p-value
D 0.372100543001792 0.133008162873339 0.239092 0.06178
f_hom 0.106746542696069 0.0337681212184294 0.0729784 0.02096
f_d 0.0593813125368232 0.0198639588674299 0.0395174 0.05498
f 0.128633212368533 0.0388844749086059 0.0897487 0.02396

2018-10-18

I compare the D and f statistics between genomic regions where the hybrid individual, Er37_SK27, either has or does not have mixed ancestries. Both D and f statistics are slightly higher where Er37_SK27 is admixed, even if Er37_SK27 itself is not used in the computation of the statistics. However, differences are not significant, as assessed by 5000 random permutations of the genomic regions. Here is the summary:

Statistic Admixed regions Not admixed Difference p-value
D 0.159047740296386 0.133996097982671 0.0250516 0.2678
f_hom 0.0398418118213496 0.0349892634711391 0.00485255 0.3332
f_d 0.0235819111306138 0.020606506731098 0.002975 0.3432
f 0.0452036938062747 0.0410208465643631 0.00418285 0.3542

2018-10-15

I run the abba/baba test and the estimation of the proportion of the genome that is introgressed. I use as input the most recent filtering of the vcf file, including data from at least one individual of each population. I also split the E. europaeus spcies in two populations, according to Admixture (and guessing the origin of two individuals included in this analysis that were absent in the Admixture one). I think I mostly confirm Kristýna's results. The D statistic, measuring the excess of the ABBA pattern over the BABA one, is positive (0.15) and highly significant, meaning that there must have been introgression between E. roumanicus and E. europaeus. The signal is still strong when using only either the eastern or the western population of E. europaeus. The proportion of the genome that is introgressed is low (below 4%, and probably below 1%), and the estimate depends on the specific statistic used.

2018-10-08

Simple update of the R script that creates the graphs from the Admixture output. The goal was simply to order individuals in a reasonable way.

2018-10-03

In the past (2018-06-30) I have used my own script to identify genomic regions of admixture in individuals known to be admixed. However, my script did not take recombination into account, as if all loci were independent. There exist several programs that take recombination into account. The oldest of them do not require phased haplotypes. I was unable to install SABER. I use LAMP instead. Some contigs did not produce results (segmentation fault), but most did. I run LAMP with several values for the time-since-admixture parameter, because for 3 of the 4 putatively admixed individuals we don't know when admixture happened. Results are quite consistent for a large range of those values. LAMP runs separately for each contig and produces an image with ancestries represented as colors along a chromosome. I joined the images corresponding to the 50 largest contigs to visualize the most important results.

2018-09-26

To address the problem detected on 2018-08-01, namely, the low accuracy of allele frequency estimates in E. concolor population, I re-do the filtering of the vcf file. Using our own scripts, we require either 5 or 4 E. concolor individuals to have data. This reduces the number of sites to 43301 or 135043, respectively, when also requiring at least 10 genotypes from both E. roumanicus and E. europaeus. Kristýna uses these filtered vcf files to run again the Admixture analysis. She finds that the small introgression signal from E. concolor into (now two) individuals of E. roumanicus did not disappear.

2018-08-09

Here, I check the effect of PCR duplicates. I count the number of second reads that map in the same place, in all libraries; I look at the allele count balance in individual libraries, and check Hardy-Weinberg statistics The conclusion is that PCR duplicates are actually removing heterozygous genotypes from the populations. In all, 16% sites show excess of homozygosity in at least one population. However, allele frequencies should not be biased.

2018-08-01

With the new filtering strategy, we find that one E. roumanicus individual shows evidence of a small contribution by E. concolor. Introgression from E. concolor is extremely unlikely, and the signal may be due to residual incomplete lineage sorting (Barbora). Here, I look at what sites are giving that signal, and make sure it is not an artifact. I confirm that there are a few thousand sites, scattered across contigs where the presumably admixed individual is heterozygous while the rest of E. roumanicus and E. concolor specimens are (almost) fixed for alternative alleles. I also notice that many of those sites do not include data for all 5 E. concolor individuals, and I hypthesize that the lack of accuracy in allele frequency estimation there is contributing to the signal.

2018-07-25

On 2018-07-24, Kristyna repeated the filtering of the original vcf file. The new filtered vcf file should be the common starting point of all downstream analysis. Here, I just add the BPF label to the INFO field of that vcf file.

2018-07-19

When trying to compare the SNPs identified as having europaeus ancestry in the hybrid individual with the sites that show signal of historic introgression between E. roumanicus and E. europaeus, Kristyna realized that very few of those SNPs were included in the abba/baba analysis. In this folder, I find out that most of those sites were arbitrarily excluded from the abba/baba test because they were too close to other SNPs.

2018-06-30

I wrote the script AncestryProfile.py, which takes the output files from Admixture and uses an empirical Bayes approach to estimate the posterior probability of each locus to have at least one allele coming from each ancestry group. I run this script on all individuals, although I am only interested in the hybrid between E. europaeus and E. romanicus. The script seems to work well. Most of the genome show levels of ancestry probability equal to the prior, which is the genome-wide contribution estimated by Admixture. But highly informative SNPs produce spikes in the profile.

2018-04-13

I run ipyrad with the whole dataset. The results are available, but not further used, yet.

2018-03-27b

Preparation of a vcf file for abba/baba test. I want to optimize the selection of sites so that there is enough data to perform the abba/baba test. The strategy is not to rely on the general filters available in vcftools, but to select sites with data for: the Hemichinus sample (required for the test), and for any number of individuals from the other three populations.

The script freq2.awk takes a vcf file with the "binary presence flag" field in the INFO section (see 2017-05-25) as input and produces a table of allele frequencies in the requested populations for abba/baba analysis.

The filtering of the vcf file performed in this folder is superseded by a common filtering strategy applied to the original vcf for all downstream analyses. Thus, the thin.recode.vcf and all .tsv results were erased from this folder. See 2018-07-25.

2018-03-27

We received more sequence data, to improve the coverage. I should re-run the alignments and ipyrad analysis. Kristýna has already updated a vcf file. Here I just want to compare the new vcf file to the old one, to evaluate how much more complete the data matrix is after sequencing more.

2017-06-09

Here I want to prepare the analysis with the multispecies coalescent model, implemented in RevBayes. I need to format the data from the ipyrad analysis and find out the most appropriate base substitution models for each locus.

This analysis is interrupted because apparently RevBayes does not support a too long list of loci. It is a good idea to filter loci, anyways, because a Bayesian analysis will last forever if I use too many of them.

2017-05-25

The current bam files were created by Kristýna with bowtie. They miss read group information, but they are sorted by individual. I should add the RG labels and run the freebayes pipeline to compare the results from this bam file with those obtained before with a bowtie2-based alignment (2016-11-18).

The results confirm that the data matrix is quite incomplete, to make any analysis based on allele frequencies difficult. In this folder I add a binary flag to the information field of the vcf file to indicate in what samples each allele is present. This makes it easy to filter the vcf file by sites covered, for example, in at least 5 individuals from E. romanicus and 5 from E. europaeus.

2017-05-18

Running ipyrad on the demultiplexed fastq files, using the reference and the de novo assembly of reads that do not map to the reference. ipyrad uses the bwa aligner. Apparently, it mapped 129,659,615 reads (40.5%) and used 8,667,728 (2.7%) further reads with the de novo assembly pipeline.

After requesting the fixing of a bug, the output included the vcf file.

These results were not used, and superseded, anyways, by 2018-04-13. I erase them.

2017-05-17

Counting total numbers of raw and mapped reads, per sample. Overall the two sequencing runs produced 319,934,047 reads (on average 6,398,680 reads per sample, with a range of 717,180 -- 14,805,528). In all, 163290876 (51%) reads were mapped with bowtie.

2017-05-09

Testing different mapping options to determine if we can improve the mapping success and the coverage. I confirm that bowtie2 is more sensitive, and that higher sensitivity leads to a larger proportion of multiple hits (ambiguous mapping), and a decrease in unmapped reads, to a lesser extent. Trimming the low quality tail of reads decreases the uniqueness of their mappings with bowtie2. Unfortunately, bowtie1 does not report the numbers of multiply mapped reads with sensitive settings. The best compromise between a large portion of unique mappings and a low proportion of unmapped reads is achieved using bowtie2 with whole reads and 'conservative' settings: 47.22% unique mappings, 18.46% unmapped.

To check the possible reasons for the unmapped reads, I manually did the following (neither the commands nor the results are saved):

  • Extract fastq files for mapped and unmapped reads from one bam file (Er71).
  • Run fastqc on both to compare the qualities.
  • Filter unmapped fastq records by maximum number of expected errors (0.5).
  • BLAST a random subsample of 50 unmapped, high-quality reads to the nr database.

I conclude that the quality of unmapped reads is just slighly lower towards the end of the reads. About 50% of the unmapped reads actually hit one or more hedgehog sequences (either Erinaceus europeus, as expected, or Atelerix albiventris, probably due to database limitations). Most such reads find several hits, suggesting that they were unmapped (with bowtie1) due to ambiguous mapping. Only one of 50 sequences mapped to a totally different organism: a nematode (a parasite?). Contamination is, at most, rare.

2016-11-30

Using bedtools, I compare the bam files among them, and find the number of sites covered by at least 3 reads in N samples. I see, for example that 157661 sites are covered with at least 3 reads in at least 40 samples, which seems to be plenty of information. Let's say that at least 60 bases of every read are good enough to detect a variant. If heterozygosity was 0.001, I would expect less than 10000 variants per individual. I suspect Freebayes is not doing a very good job at identifying variants.

2016-11-24

The same conclusions than before are confirmed when including combined variants in some statistics. The big picture is one of large genetic distance between populations, and little variation within a population. This seems to be the reason why little genetic structure is observed.

The question arises if the little genetic variation observed within populations is due to a waste of reads in loci that don't get covered in all samples, or if loci are quite homogeneously covered but turn out to be actually invariant in many cases.

To answer this question, both variant and non-variant sites should have been outputed by freebayes. In any case, the decay of the number of sites covered in X samples when X increases suggests that the problem is one of too many loci for not so many reads.

2016-11-18

Using freebayes to call variants, and vcftools to summarize them, we notice:

  1. Several types of variants, including SNP, MNP, complex variants, and multiple combinations thereof. The current freebayes settings seem to have excluded indels.
  2. The vast majority of variable sites are fixed variants among populations, rather than variation segregating within populations.

About

Population genomics of european hedgehogs across contact zones.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages