NanoExpansion: a tool for the characterization of Repeat Expansion Patterns in Nanopore sequencing samples
NanoExpansion is a python software for the extraction and characterization of Short Tandem Repeats (STRs) data from nanopore sequencing. It exploits the result from straglr to generate plots of the expansion site of the region of interest (e.g. gene DMPK for DM1) and to return the compact expansion pattern string. This software reuses some ideas that can be found in EPI2ME wf-human-variation. It implements a recursive search of motifs of interest, specified by the user (using prior biological information).
Some files are needed in order to run NanoExpansion:
- a sorted and indexed .bam file of the sample of interest
- .tsv, .bed and .vcf output files from straglr
- the catalogue of STR annotation for Stranger
- a .bed file with the region and the motif of expansion
Moreover, the folder structure must be the following:
sample/
│
├── nanoexpansion/
├── <sample>-straglr_old.tsv
├── <sample>-straglr_old.bed
├── <sample>-straglr_old.vcf
├── <sample>_roi.bam
├── <sample>_roi.bam.bai
├── variant_catalog_hg38.json
├── <gene>_filter.bed
└── str_repeats.bed
and the required files must be inside nanoexpansion folder.
Depending on the straglr version used, you would need to transform the output .tsv file in order to have only the following columns:
'chrom', 'start', 'end', 'repeat_unit', 'genotype', 'read', 'copy_number', 'size', 'read_start', 'strand', 'allele'
If your .tsv does not satisfy this requirement, the snakemake pipeline will handle this by transforming .tsv and .vcf.
-
Download the repository
git clone https://github.com/Cesco16/NanoExpansion.git cd NanoExpansion -
Create and activate the conda environment
conda env create -f requirements.yaml conda activate nanoexpansion
-
Index .bam STR file and keep only reads with STR of interest
samtools view -b -h -o <sample>_roi.bam -L <gene>_filter.bed <sample>_sort.bam samtools index <sample>_roi.bam
- Run the Snakemake pipeline
snakemake --cores 4 --config sample=<sample> motif='CAG' interruption='CGG' ins1=2 ins2=1 gene="DMPK" disease="DM1"
| Option | Description |
|---|---|
sample <STR> |
ID of the sample to process. Required. |
motif <STR> |
Main repeat motif. Default is CAG |
interruption <STR> |
Interruption repeat motif. Default is CAA. |
ins1 <INT> |
Threshold for correction of main repeat motif. Default is 3. |
ins2 <INT> |
Threshold for correction of interruption repeat motif. Default is 1. |
gene <STR> |
Gene with the expanded motif. Default is DMPK. |
disease <STR> |
Disease corresponding to the main motif. Default is DM1 |
Here an example of NanoExpansion applied to a patient affected by Mytonic Dystrophy type 1 (DM1), which is characterized by an expansion of the CTG triplet in gene DMPK. Thanks to NanoExpansion, it is possible to characterize the wild-type and the mutated allele. The numbers in the plots represents the number of nucleotides in each region. The number of repeats is obtained dividing those numbers by the length of the repeat motif (in this case, 3).
and also the mutated reads. Here an example of an expanded read, that shows a TTG interruption pattern:
Finally, NanoExpansion returns the complete characterization of repeat patterns in all the available reads:
4f5fb621-ed87-45c4-84f2-8d6b5794e655: (CTG)5
822e5d7b-a2c2-4290-aeb9-1759c1d65276: (CTG)4
79ea0cda-7e44-4ea1-aaac-1d943b29bdf4: (CTG)5
10b08737-492e-4c80-86ee-5f2039fd069d: (CTG)37(CTC)252(CTG)35
7b97eb27-ef56-413b-a422-ccce2abea0d3: (CTG)296(CTC)152(CTG)61
3e04399c-f889-453e-bd32-4c26e0ece28b: (CTG)558
6c779360-4d9d-444e-a8b4-fcd37c65d339: (CTG)5
bc0a2b2a-e11b-4242-aed1-80ca1a1400e8: (CTG)5
a25cb320-6040-476c-b2ef-c490ab2b599b: (CTG)5
NanoExpansion functioning can be tested using the synthetically generated reads in the benchmark folder. Each sample is named with the actual number of repeats in gene DMPK. Results from NanoExpansion must agree with them. NanoExpansion will fail only on sample output_14_58_21_25, since the insertion pattern falls outside the main repeated pattern (CTG).
- Actually, NanoExpansion works only with hg38 genome reference. The extension to T2T HS1 reference will be released soon.
- Always check the start-end columns in files .tsv and .bed: they must be the start-end position of the repeat expansion region (manually change them if needed).
- NanoExpansion can correctly detect repeated pattern only if the interruption motif entirely falls within the main repeat motif (e.g., CTG for DM1).
- Actually, NanoExpansion works only on DM1 and ALS samples (which are the ones known to have interruption patterns).
This project is licensed under the MIT License.
You are free to use, modify, and distribute this software under the terms of the license.
If you use NanoExpansion in your research or work, please cite the GitHub repository:
@misc{NanoExpansion
author = {Francesco Casadei},
title = {NanoExpansion: a tool for the characterization of Repeat Expansion Pattern in Nanopore sequencing samples},
year = {2026},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Cesco16/NanoExpansion}}
}


