Skip to content

Cauwth/CodeForMillet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeForMillet

This repository contains the scripts and pipelines used in the foxtail millet (Setaria italica) pan-genome and population genomics study. It provides reproducible workflows for SV detection, SNP/indel analysis, genotype imputation evaluation, PAV-based prediction, and downstream analyses.

Directory overview

  • WGS/

    • Read processing and variant calling (alignment, sorting, duplicate marking, VCF extraction).
  • SV-calling/

    • Assembly alignment and SyRI-based structural variation calling, including Me34V (wild progenitor) comparisons.
  • SNP_indel_SV-analysis/

    • Filtering, merging, imputation, GWAS preparation, annotation, and ML interpretation. Includes scripts for merging SV/SNP/indel datasets and downstream analyses.
  • PAV-predict/

    • Presence/absence variant encoding and predictive modeling (LASSO, LightGBM, RandomForest, SVM) with result visualization.
  • pangenome/

    • Scripts for building and handling the graph-based pangenome (minigraph/minimap2 workflows, Jasmine merging, SyRI preparation).
  • imputation-evaluate/

    • Pipelines to evaluate imputation accuracy under varying masking ratios and reference sample sizes. Contains separate SNP and SV+SNP evaluation modules.
  • Code organization

    • Scripts are organized by numeric prefixes to indicate execution order where applicable (e.g., 001*, 002*). Languages used: Bash, Python, R, and Jupyter Notebooks.

Quick usage

  1. Install required tools: bcftools, samtools, minimap2, minigraph, jasmine, syri, beagle, plink, R (with required packages), Python (with dependencies).
  2. Review the script headers for per-script parameters and required input paths.
  3. Run pipelines in the directory order or follow numeric scripts for stepwise reproduction of results.

File naming and core outputs

  • 5.pan.biallelic.filtered.realigned.vcf.gz — Finalized, biallelic SV VCF used as the primary SV reference in analyses.
  • pangenome.gfa — Graph-based pangenome representation for Setaria italica.
  • Me34V_scaffold_7:26.84Mb-26.94Mb.zip (or packaged .tar.gz) — SyRI results for Me34V vs. target assemblies in the specified interval.

Reproducibility notes

  • Many scripts assume a POSIX environment and standard bioinformatics tools in $PATH.
  • Wherever possible, set absolute paths in wrapper scripts and use conda environments or Docker for reproducible runs.

Citation

If you use these scripts, please cite the associated publication describing the foxtail millet pangenome and analyses.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages