Skip to content

Latest commit

 

History

History
224 lines (149 loc) · 6.21 KB

TraceAncestor.rst

File metadata and controls

224 lines (149 loc) · 6.21 KB

Chromosome painting using TraceAncestor

TraceAncestor is a suite of script that allows to estimate the allelic dosage of ancestral alleles in hybrid individuals and then to perform chromosome painting.

Installation

git clone https://github.com/gdroc/GeMo_tutorials.git
cd GeMo_tutorials

Download dataset, you only need to launch the script download_dataset.pl without any parameter

perl download_dataset.pl

This script create a new directory data

data/
├── Ahmed_et_al_2019_color.txt
├── Ahmed_et_al_2019_individuals.txt
├── Ahmed_et_al_2019_origin.txt
├── Ahmed_et_al_2019.vcf

Workflow

vcf2gst.pl

Usage

This script is used to define GST values from individuals that are identified as pure breed for an ancestor.

Must be used on pure breed. If there is introgressed part on the genome of the individual, the part must be removed before analysis.

bin/vcf2gst.pl --help

Parameters :
   --vcf       vcf containing the ancestors and other individuals to scan [Required]
   --ancestor  A two column file with individuals in the first column and group tag (i.e. origin) in the second column [Required]
   --depth     minimal depth for a snp to be used in the analysis (Default 5)
   --output    output file name (Default GSTmatrix.txt)
   --help

Input

--ancestor Ancestor file (Required)

A two column file with individuals in the first column and group tag (i.e. origin) in the second column

individuals origin
De_Chios Mandarin
Shekwasha Mandarin
Sunki Mandarin
Cleopatra Mandarin
Pink Pummello
Timor Pummello
Tahitian Pummello
Deep_red Pummello
Corsican Citron
Buddha_Hand Citron
--vcf VCF file (Required)

Now, you can run the following command

perl bin/vcf2gst.pl --ancestor data/Ahmed_et_al_2019_origin.txt --vcf data/Ahmed_et_al_2019.vcf --output GSTMatrix.txt

Output

The output is a CSV file containing GST (inter-population differentiation parameter) information:

with :

  • #CHROM = chromosome name
  • POS = position of DSNP
  • REF = Base of the reference allele of this DSNP
  • ALT = Base of the alternative allele of this DSNP
  • %Nref = Percentage of maximal missing data for this DSNP
  • GST = value of GST (inter-population differentiation parameter) (With 1,2,3 the ancestors names)
  • F = Alternative allele frequency for each ancestor (With 1,2,3 the ancestors names)

prefilter.pl

Usage

This script is used to define a matrix of ancestry informative markers from the matrix gotten at the step 1.

bin/prefilter.pl --help
Parameters :
    --matrix    GST matrix [Required]
    --gst       threshold for gst (Default : 0.9)
    --missing   threshold for missing data (Default 0.3)
    --output    output file name (Default Diagnosis_matrix)
    --help      display this help

Now, you can run the following command

perl bin/prefilter.pl --matrix GSTMatrix.txt --output Diagnosis_matrix.txt

Output

A matrix containing all the ancestry informative markers for every ancestors.

with:

  • ancestor = Ancestor names
  • chromosome = Chromosome numbers
  • position = Position of the SNP marker
  • allele = Base of the ancestral allele

TraceAncestor.pl

Usage

bin/TraceAncestor.pl --help

Parameters :
    --matrix     Diagnosis matrix [Required]
    --vcf       vcf of the hybrid population
    --individuals    A two column file with individuals to scan for origin (same as defined in the VCF headerline) in the first column and the ploidy in the second column [Required]
    --window    number of markers by window (Default 10)
    --lod       LOD value to conclude for one hypothesis (Default 3)
    --freq      theoretical frequency used to calcul the LOD (Default 0.99)
    --cut       number of K bases in one window (Default 100)
    --dirout    Directory output (Default result)
    --help      display this help

Input

--individuals A two column file with individuals to scan for origin (same as defined in the VCF headerline) in the first column and the ploidy in the second column.

Now, you can run the following command

perl bin/TraceAncestor.pl --matrix Diagnosis_matrix.txt --vcf data/Ahmed_et_al_2019.vcf --individuals data/Ahmed_et_al_2019_individuals.txt

Output

For each individual present in the file data/Ahmed_et_al_2019_individuals.txt, 4 outputs are produced, prefixed with the name of indivual :

  • Bergamot_ideo.txt : A text file of the position of genomic blocks the ancestry mosaic with a succession of genomic blocks along the chromosome
chr haplotype start end ancestral_group
1 0 1 28700000 Citron
1 1 1 28700000 Pummello
2 0 1 600000 Citron
2 0 3000001 4200000 Mandarin
2 0 4200001 10400000 Citron
2 0 10800001 35200000 Citron
  • Bergamot_chrom.txt : A tab file with name, length and karyotype based on ploidy.
  • Bergamot_ancestor.txt : Frequency of ancestors alleles along chromosome for the particular hybrid focused.
  • Bergamot_curve.txt : Frequency of ancestors alleles along chromosome for the GeMo visualization tool.

Visualization and block refinement with GeMo

Go to GeMo WebApp

  • Load data has follow
Gemo_Vizualise

References