This repository contains the scripts and pipelines used in the foxtail millet (Setaria italica) pan-genome and population genomics study. It provides reproducible workflows for SV detection, SNP/indel analysis, genotype imputation evaluation, PAV-based prediction, and downstream analyses.
-
WGS/
- Read processing and variant calling (alignment, sorting, duplicate marking, VCF extraction).
-
SV-calling/
- Assembly alignment and SyRI-based structural variation calling, including Me34V (wild progenitor) comparisons.
-
SNP_indel_SV-analysis/
- Filtering, merging, imputation, GWAS preparation, annotation, and ML interpretation. Includes scripts for merging SV/SNP/indel datasets and downstream analyses.
-
PAV-predict/
- Presence/absence variant encoding and predictive modeling (LASSO, LightGBM, RandomForest, SVM) with result visualization.
-
pangenome/
- Scripts for building and handling the graph-based pangenome (minigraph/minimap2 workflows, Jasmine merging, SyRI preparation).
-
imputation-evaluate/
- Pipelines to evaluate imputation accuracy under varying masking ratios and reference sample sizes. Contains separate SNP and SV+SNP evaluation modules.
-
Code organization
- Scripts are organized by numeric prefixes to indicate execution order where applicable (e.g.,
001*,002*). Languages used: Bash, Python, R, and Jupyter Notebooks.
- Scripts are organized by numeric prefixes to indicate execution order where applicable (e.g.,
- Install required tools:
bcftools,samtools,minimap2,minigraph,jasmine,syri,beagle,plink, R (with required packages), Python (with dependencies). - Review the script headers for per-script parameters and required input paths.
- Run pipelines in the directory order or follow numeric scripts for stepwise reproduction of results.
5.pan.biallelic.filtered.realigned.vcf.gz— Finalized, biallelic SV VCF used as the primary SV reference in analyses.pangenome.gfa— Graph-based pangenome representation for Setaria italica.Me34V_scaffold_7:26.84Mb-26.94Mb.zip(or packaged.tar.gz) — SyRI results for Me34V vs. target assemblies in the specified interval.
- Many scripts assume a POSIX environment and standard bioinformatics tools in
$PATH. - Wherever possible, set absolute paths in wrapper scripts and use conda environments or Docker for reproducible runs.
If you use these scripts, please cite the associated publication describing the foxtail millet pangenome and analyses.