Tutorial: Analyzing AAV Data

Last Updated: 05/03/2024

NOTE: this repo is being deprecated. Please go to FormBio's laava for the latest maintained open source code and wiki. Thanks!

Portions of this wiki are still being kept here for historical purpose, however they could be deleted at any time

Example Datasets, scAAV and ssAAV

NOTE 1: if you are using SMRTLink v11 and later, please use the "AAV" mode in Run Design which will run recalladapter --> CCS with heteroduplex detection on the instrument, which will save you tons of computing time!

NOTE 2: If you want to understand more how the AAV bfx + report in this GitHub works, check out the AAV bioinformatics webinar I did! Note however the nomenclature has slightly changed and is described in the section AAV classification.

(optional) 0A. Recall + CCS from CLR (subread) data, if using SMRTLink prior to v11
(optional) 0B. Demultiplexing samples
1. Preparing the genome and annotation file
2. Mapping reads to AAV Genome
3. Summarizing and Plotting Alignment Results
4. AAV Classification: ssAAV vs scAAV, full vs partials vs backbone, and more

0A. Recall Adapter and CCS re-run if using CLR (subread) data

recalladapters	Recalling adapter from CLR data
Required for	scAAV (or if there's a mixture of scAAV with ssAAV)
Input	CLR data (`<movie>.subreads.bam,.bam.pbi` and `<movie>.scraps.bam,.bam.pbi` and `<movie>.subreadset.xml`)
Output	Recalled Subreads (`<movie>.recall_subreads.bam`)

recalladapters is necessary for self-complementary AAV (scAAV) constructs because only one side of the molecule has a SMRT Adapter. Without recalling adapter, multiplexed scAAV data may show poor demultiplexing results (as only one side of the barcode can be found). With proper recall adapter, the new CCS reads should double in read length, and mapping to the AAV construct should show a primary and supplementary alignment that is reverse-complement to each other and covering the whole ITR-to-ITR region.

Make sure the input XML file is straight from the sequencer. A newly generated subreadset might break recalladapters.

Use the following command to recall adapters from a CLR (subreads) BAM file:

recalladapters \
    -j <threads> \
    --minSnr=2.0 \
    --disableAdapterCorrection \
    --subreadset <movie>.subreadset.xml \
    -o <movie>.recalled

For the example scAAV dataset, see the run.sh file in step 1-recall adapters on the exact commands used to generate the recalled BAM files.

0.1 Fix the `recalladapters` output

To upload the recalled data to SMRTlink, the subreadset.xml file will need to be regenerated (recalladapters produces a broken subreadset). We'll use dataset create on the recalled subreads file to make an importable subreadset. dataset is available through pbbioconda.

dataset create \
    --type SubreadSet \
    --generateIndices \
    --name <movie>.recalled.fixed \
    <movie>.recalled.fixed.subreadset.xml
    <movie>.recalled.subreads.bam

The new <movie>.recalled.fixed.subreadset.xml should now be importable into SMRTlink for CCS.

0.2 Re-run CCS after Recall Adapter

ccs	Re-run CCS after recalladapters (recommend v6.2.0+)
Required for	scAAV
Input	Recalled Subreads (`<movie>.recall_subreads.bam`)
Output	Recalled CCS (`<movie>.hifi_reads.bam`)

After recalladapters, CCS must be re-run.

In the example dataset, we show how this is run via SMRTLink script that also does de-multiplexing at the same time.

You can use SMRTLink (or Cromwell workflows) to run CCS (with or without demultiplexing, depending on your sample), since CCS can be time-consuming without proper chunking that can be handled by SMRTLink/Cromwell. See the SMRTLink Parameter set used for this CCS With Demultiplexing Run.

For datasets with mixed populations of AAV molecule types (ss and sc), ccs should be run using the --by-strand option.

0B. (optional) Demultiplexing samples

If your sample is multiplexed, use SMRTLink Demultiplexing or CCS With Demultiplexing as a GUI option to demultiplex samples.

If using command line, use the lima tool.

An example is below:

lima -s --peek-guess --split-named \
        hifi_reads.bam \
        barcodes.fasta \
        demux.bam

All the remaining analyses are shown as performed for a single (demultiplexed) sample.

1. Preparing the genome and annotation file

The following fasta files (if available) should be combined into a single "genome" fasta file:

(required) vector (including the AAV vector + plasmid backbone as a single sequence)
helper plasmid
repcap plasmid
host genome (ex: hg38)

NOTE the sequence IDs should be free of blank spaces and symbols. Stick with numbers, alphabet letters, and _ and -. If necessary, rename the sequence IDs in the combined fasta file.

Create a annotation.txt file according to the following format:

NAME=<sequence id>;TYPE={vector|helper|repcap|host|lambda};REGION=<start>-<end>;

Only the vector annotation is required and must be marked with REGION= (the position from ITR to ITR) as well. All other types are optional.

For example:

NAME=myAAV_plasmid;TYPE=vector;REGION=100-2000;
NAME=myRepCap_plasmid;TYPE=repcap;REGION=500-1500;
NAME=myHelper_plasmid;TYPE=helper;
NAME=chr1;TYPE=host;
NAME=chr2;TYPE=host;
NAME=chr3;TYPE=host;

IMPORTANT!!! you must have exactly the same number of chromosomes in the reference fasta file as annotations file. This is especially common if you are including human genome (hg38) which has a lot of alternative chromosomes. It is recommended that you use a version of hg38 that only lists the major chromosomes.

2. Mapping reads to AAV Genome

minimap2	Map CCS to Genome
Required for	all AAV samples
Input	CCS (`<movie>.hifi_reads.fasta`)
Output	Mapped, sorted CCS SAM/BAM file (`.sorted_mapped.sam

If you have multiplexed samples, map this per demultiplexed CCS BAM file.

To map the HiFi/CCS sequences to the genome, run the following:

minimap2 --eqx -a --secondary=no \
   -t <threads> \
   genome.fasta <movie>.hifi_reads.fasta > mapped.sam

samtools sort -n -O SAM mapped.sam > mapped.sort_by_read.sam

3. Summarizing and Plotting Alignment Results

You can directly clone the AAV repo. You will need to make sure you have the Python and R dependencies. Read the AAV install section to see how to get the dependencies.

usage: summarize_AAV_alignment.py [-h] [--max_allowed_missing_flanking MAX_ALLOWED_MISSING_FLANKING] [--cpus CPUS]
                                  [--debug]
                                  sam_filename annotation_txt output_prefix

positional arguments:
  sam_filename          Sorted by read name SAM file
  annotation_txt        Annotation file
  output_prefix         Output prefix

optional arguments:
  -h, --help            show this help message and exit
  --max_allowed_missing_flanking MAX_ALLOWED_MISSING_FLANKING
                        Maximum allowed missing flanking bp to be still considered 'full'
                        (default: 100)
  --cpus CPUS           Number of CPUs (default: 1)
  --debug

For example:

python summarize_AAV_alignment.py \
         mapped.sort_by_read.sam annotation.txt \
         myAAV 
         --cpus 12

We can now plot the summary into a PDF:

Rscript plotAAVreport.R myAAV annotation.txt

4. AAV Classification

Refer to laava's wiki for the latest definitions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: Analyzing AAV Data

0A. Recall Adapter and CCS re-run if using CLR (subread) data

0.1 Fix the `recalladapters` output

0.2 Re-run CCS after Recall Adapter

0B. (optional) Demultiplexing samples

1. Preparing the genome and annotation file

2. Mapping reads to AAV Genome

3. Summarizing and Plotting Alignment Results

4. AAV Classification

Clone this wiki locally

Tutorial: Analyzing AAV Data

0A. Recall Adapter and CCS re-run if using CLR (subread) data

0.1 Fix the recalladapters output

0.2 Re-run CCS after Recall Adapter

0B. (optional) Demultiplexing samples

1. Preparing the genome and annotation file

2. Mapping reads to AAV Genome

3. Summarizing and Plotting Alignment Results

4. AAV Classification

Clone this wiki locally

0.1 Fix the `recalladapters` output