# Mapping and filtering BAMs for IsoQuant

This notebook writes a script to run minimap2 on our tagged BAMs. Next we filter aligned BAMs for high-scoring alignments. Finally we run IsoQuant on the filtered BAMs.

### imports

In [None]:
from collections import defaultdict
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor

from mdl.sc_isoform_paper import today
from mdl.sc_isoform_paper.constants import MASSEQ_FILENAMES
from mdl.sc_isoform_paper.util import filter_reads

import pysam

### setup

In [None]:
pysam.set_verbosity(0)

root_dir = Path.home()
sh_dir = root_dir / "sh_scripts"
reference_path = root_dir / "reference"

grch38_fasta = reference_path / "GRCh38" / "GRCh38.fasta"

gencode_basic_gtf = reference_path / "GRCh38.gencode.v39.annotation.basic.gtf"
gencode_bed = reference_path /  "GRCh38.gencode.v39.annotation.bed"

data_path = root_dir / "data" / "masseq"
cdna_path = data_path / "20240708_cdna"

In [None]:
tagged_bams = sorted(cdna_path.glob("*tagged.bam"))

mapped_path = data_path / f"{today}_minimap/"
mapped_path.mkdir(exist_ok=True)

### minimap2

With some extra options, we can include the barcode and UMI tags in the minimap2 output, so that we end up with a single-cell alignment file.

`minimap2` optionally takes a BED file for junction information. We can generate that with their `paftools.js` script:

```
paftools.js gff2bed GRCh38.gencode.v39.annotation.gtf > GRCh38.gencode.v39.annotation.bed
```

Note that we provide the full Gencode v39 annotation to `minimap2` here. We are using the "basic" annotation for other analysis because it contains a more reliable set of annotations.

In [None]:
with open(sh_dir / f"{today}_minimap_cmds.sh", "w") as out:
    for in_file in tagged_bams:
        fq_file = mapped_path / (in_file.with_suffix(".fastq")).name
        out_sam = fq_file.with_suffix(".mapped.sam")
        out_bam = fq_file.with_suffix(".mapped.sorted.bam")

        if out_bam.exists():
            continue
        print(f"samtools fastq -TCB,UB {in_file} > {fq_file}", file=out)
        print(
            "minimap2 -t 16 -ayx splice:hq -uf -G1250k -Y --MD",
            f"--junc-bed {gencode_bed}", grch38_fasta,
            f"{fq_file} > {out_sam}",
            file=out
        )
        print(f"samtools sort {out_sam} > {out_bam}", file=out)
        print(f"rm {out_sam} {fq_file}", file=out)


### Alignment filtering

We filter the aligned BAMs for high-scoring primary alignments.

High-scoring is defined as: `MAPQ == 60` or `[alignment score] / [query len] > 0.9`

We are also filtering out secondary and supplementary alignments here. In our tests, including high-scoring secondary alignments did not meaningfully improve the isoform identification in this dataset, and their presence complicates the internal priming analysis.


In [None]:
def filter_bam(mapped_bam):
    if (new_bam := mapped_bam.with_suffix(".primary.bam")).exists():
        print(f"{new_bam} already exists")
        return

    with (
        pysam.AlignmentFile(mapped_bam, "rb", threads=8) as fh,
        pysam.AlignmentFile(new_bam, "wb", template=fh, threads=8) as out,
    ):
        for i, a in enumerate(filter_reads(fh)):
            out.write(a)

    return new_bam, i + 1  # the number of reads written to output

In [None]:
mapped_bams = sorted(mapped_path.glob("*.mapped.sorted.bam"))

with ProcessPoolExecutor(8) as exc:
    filtered_counts = dict(filter(None, exc.map(filter_bam, mapped_bams)))

key_to_samples = defaultdict(list)
for in_file in sorted(filtered_counts):
    key_to_samples[int(in_file.name.rsplit(".")[2])].append(in_file)


### IsoQuant commands

Now that we've filtered out low-quality alignments, we run IsoQuant. We make sure to include the barcode and UMI values in the output for later analysis.

In [None]:
output_path = data_path / f"{today}_isoquant"
output_path.mkdir(exist_ok=True)

In [None]:
with open(sh_dir / f"{today}_isoquant_cmds.sh", "w") as out:
    for i, input_bams in key_to_samples.items():
        s = MASSEQ_FILENAMES[i]

        out_dir = output_path / s
        out_dir.mkdir(exist_ok=True)

        # need to index our BAM files before 
        print("samtools index --threads 12 -M", *key_to_samples[i], file=out)
        print(
            "isoquant.py",
            "--data_type pacbio_ccs",
            "--count_exons",
            "--threads 16",
            "--force",
            f"--reference {grch38_fasta}",
            f"--genedb {gencode_basic_gtf}",
            "--complete_genedb",
            "--stranded forward",
            "--polya_requirement never",
            "--bam_tags CB,UB",
            "--bam", *input_bams,
            f"-o {out_dir}",
            file=out
        )
