Skip to content

Cesco16/NanoExpansion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

262 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Descrizione dell'immagine

NanoExpansion: a tool for the characterization of Repeat Expansion Patterns in Nanopore sequencing samples

NanoExpansion is a python software for the extraction and characterization of Short Tandem Repeats (STRs) data from nanopore sequencing. It exploits the result from straglr to generate plots of the expansion site of the region of interest (e.g. gene DMPK for DM1) and to return the compact expansion pattern string. This software reuses some ideas that can be found in EPI2ME wf-human-variation. It implements a recursive search of motifs of interest, specified by the user (using prior biological information).

Requirements

Some files are needed in order to run NanoExpansion:

  • a sorted and indexed .bam file of the sample of interest
  • .tsv, .bed and .vcf output files from straglr
  • the catalogue of STR annotation for Stranger
  • a .bed file with the region and the motif of expansion

Moreover, the folder structure must be the following:

sample/
│
├── nanoexpansion/
    ├── <sample>-straglr_old.tsv
    ├── <sample>-straglr_old.bed    
    ├── <sample>-straglr_old.vcf    
    ├── <sample>_roi.bam    
    ├── <sample>_roi.bam.bai    
    ├── variant_catalog_hg38.json    
    ├── <gene>_filter.bed    
    └── str_repeats.bed

and the required files must be inside nanoexpansion folder.

Depending on the straglr version used, you would need to transform the output .tsv file in order to have only the following columns:

'chrom', 'start', 'end', 'repeat_unit', 'genotype', 'read', 'copy_number', 'size', 'read_start', 'strand', 'allele'

If your .tsv does not satisfy this requirement, the snakemake pipeline will handle this by transforming .tsv and .vcf.

Run NanoExpansion

  1. Download the repository

    git clone https://github.com/Cesco16/NanoExpansion.git
    cd NanoExpansion
  2. Create and activate the conda environment

    conda env create -f requirements.yaml
    conda activate nanoexpansion
  3. Index .bam STR file and keep only reads with STR of interest

    samtools view -b -h -o <sample>_roi.bam -L <gene>_filter.bed <sample>_sort.bam
    samtools index <sample>_roi.bam
  1. Run the Snakemake pipeline
    snakemake --cores 4 --config sample=<sample> motif='CAG' interruption='CGG' ins1=2 ins2=1 gene="DMPK" disease="DM1"

Options

Option Description
sample <STR> ID of the sample to process. Required.
motif <STR> Main repeat motif. Default is CAG
interruption <STR> Interruption repeat motif. Default is CAA.
ins1 <INT> Threshold for correction of main repeat motif. Default is 3.
ins2 <INT> Threshold for correction of interruption repeat motif. Default is 1.
gene <STR> Gene with the expanded motif. Default is DMPK.
disease <STR> Disease corresponding to the main motif. Default is DM1

Example of usage

Here an example of NanoExpansion applied to a patient affected by Mytonic Dystrophy type 1 (DM1), which is characterized by an expansion of the CTG triplet in gene DMPK. Thanks to NanoExpansion, it is possible to characterize the wild-type and the mutated allele. The numbers in the plots represents the number of nucleotides in each region. The number of repeats is obtained dividing those numbers by the length of the repeat motif (in this case, 3).

Example of wild-type allele in gene DMPK

and also the mutated reads. Here an example of an expanded read, that shows a TTG interruption pattern:

Example of STR with interruption pattern in gene DMPK

Finally, NanoExpansion returns the complete characterization of repeat patterns in all the available reads:

4f5fb621-ed87-45c4-84f2-8d6b5794e655: (CTG)5
822e5d7b-a2c2-4290-aeb9-1759c1d65276: (CTG)4
79ea0cda-7e44-4ea1-aaac-1d943b29bdf4: (CTG)5
10b08737-492e-4c80-86ee-5f2039fd069d: (CTG)37(CTC)252(CTG)35
7b97eb27-ef56-413b-a422-ccce2abea0d3: (CTG)296(CTC)152(CTG)61
3e04399c-f889-453e-bd32-4c26e0ece28b: (CTG)558
6c779360-4d9d-444e-a8b4-fcd37c65d339: (CTG)5
bc0a2b2a-e11b-4242-aed1-80ca1a1400e8: (CTG)5
a25cb320-6040-476c-b2ef-c490ab2b599b: (CTG)5

Benchmark

NanoExpansion functioning can be tested using the synthetically generated reads in the benchmark folder. Each sample is named with the actual number of repeats in gene DMPK. Results from NanoExpansion must agree with them. NanoExpansion will fail only on sample output_14_58_21_25, since the insertion pattern falls outside the main repeated pattern (CTG).

Limitations

  • Actually, NanoExpansion works only with hg38 genome reference. The extension to T2T HS1 reference will be released soon.
  • Always check the start-end columns in files .tsv and .bed: they must be the start-end position of the repeat expansion region (manually change them if needed).
  • NanoExpansion can correctly detect repeated pattern only if the interruption motif entirely falls within the main repeat motif (e.g., CTG for DM1).
  • Actually, NanoExpansion works only on DM1 and ALS samples (which are the ones known to have interruption patterns).

License

This project is licensed under the MIT License.
You are free to use, modify, and distribute this software under the terms of the license.

Citation

If you use NanoExpansion in your research or work, please cite the GitHub repository:

@misc{NanoExpansion
author = {Francesco Casadei},
title = {NanoExpansion: a tool for the characterization of Repeat Expansion Pattern in Nanopore sequencing samples},
year = {2026},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Cesco16/NanoExpansion}}
}

About

A tool for the analysis of STR obtained through nanopore sequencing

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages