### Impact of Internal Priming on IsoQuant

We classify the aligned reads as properly primed or not, based on their genomic context and some heuristics. This is accomplished with a command-line tool, `annotate_priming`, which takes a config file that we will create here. It takes a few minutes per BAM file (in parallel) and produces new, annotated BAM files. After running it we count up the reads in the annotated files and produce some plots.

We developed the heuristics based on previous tools and literature, as well as inspection of single-cell and bulk data. When inspecting particular cases it is clear that there can be both false positives and false negatives in these calls, but we believe the overall picture of priming is accurate. Developing better datasets and methods for detecting improper priming is an important future direction.

### imports

In [None]:
import pickle
from collections import defaultdict, Counter
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path

import numpy as np

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick

import pysam
import yaml

from mdl.sc_isoform_paper import today
from mdl.sc_isoform_paper.constants import MASSEQ_FILENAMES, MASSEQ_KEYS, SAMPLE_COLORS
from mdl.sc_isoform_paper.isoquant import IsoQuantClass
from mdl.sc_isoform_paper.priming import Priming, count_classes_and_isoquant, tx_count_breakdown
from mdl.sc_isoform_paper.plots import plot_isoform_area

### Setup

In [None]:
pysam.set_verbosity(0)

root_dir = Path.home()
sh_dir = root_dir / "sh_scripts"
reference_path = root_dir / "reference"

grch38_fasta = reference_path / "GRCh38" / "GRCh38.fasta"

gencode_gtf = reference_path / "GRCh38.gencode.v39.annotation.basic.gtf"
gencode_polya_gtf = reference_path / "GRCh38.gencode.v39.polyAs.gtf.gz"

polya_motif_file = reference_path / "mouse_and_human.polyA_motif.txt"

data_path = root_dir / "data" / "masseq"
minimap_path = data_path / "20240709_minimap"
isoquant_path = data_path / "20240722_isoquant"
annotated_path = data_path / "20250124_annotated"

priming_counts_file = data_path / "isoquant_priming_counts.pickle"

figure_path = root_dir / "202501_figures"

In [None]:
sample_order = [MASSEQ_KEYS[i] for i in (1, 3, 4)]
sample_order

In [None]:
# mapped bams
mapped_bams = sorted(minimap_path.glob("*tagged.mapped.sorted.primary.bam"))

# isoquant prefix for each bam
isoquant_paths = [isoquant_path / f"{MASSEQ_FILENAMES[int(mb.name.split('.')[2])]}" / "OUT" for mb in mapped_bams]

len(mapped_bams), len(isoquant_paths)

In [None]:
config = dict(
    reference_fasta=str(grch38_fasta),
    reference_gtf=str(gencode_gtf),
    polya_motif_file=str(polya_motif_file),
    polya_annotations=str(gencode_polya_gtf),
    priming_parameters=dict(
        feature_pre=5,
        feature_post=5,
        motif_pre=30,
        motif_post=20,
        pas_pre=5,
        pas_post=20,
        polya_window=20,
        polya_max_len=6,
        polya_max_n=12,
    ),
    output_tag=["simple", "full"],
    bam_paths=list(map(str, mapped_bams)),
    isoquant_paths=list(map(str, isoquant_paths))
)


In [None]:
with open(data_path / f"{today}_priming_config.yaml", "w") as out:
    yaml.dump(config, out, sort_keys=False)

### Running the priming classifier

Classification is available as a command-line tool in this package, under the name `annotate_priming`:

```shell
annotate_priming --config-file path/to/priming_config.yaml -p 8 --filter-isoquant-by-bam
```

This can run in parallel over multiple BAM files at once. Each read will be annotated with a priming class depending on the context of the read alignment.

If IsoQuant output paths are provided, the resulting BAM files will have additional information from the read assignment and model construction steps. Note that this option will increase the memory usage of the script as the IsoQuant data must be loaded into memory.

### Reading the annotated files

Reading through the annotated BAMs only takes a few minutes, but you need those BAMs in the first place. Instead, you can load the pickle of results.

In [None]:
annotated_bams = sorted(annotated_path.glob("*annotated.bam"))
len(annotated_bams)

In [None]:
%%time

if priming_counts_file.exists():
    with priming_counts_file.open("rb") as fh:
        tx_priming_counts = pickle.load(fh)
else:
    tx_priming_counts = defaultdict(Counter)
    
    with ProcessPoolExecutor(8) as exc:
        for k, txc in exc.map(
            count_classes_and_isoquant,
            (MASSEQ_KEYS[int(fn.name.split(".")[2])] for fn in annotated_bams),
            annotated_bams
        ):
            tx_priming_counts[k] += txc

if not priming_counts_file.exists():
    with priming_counts_file.open("wb") as out:
        pickle.dump(tx_priming_counts, out)

In [None]:
# supplementary table 7
def print_priming_calls(tx_priming_counts, include_mito=True):
    keys = sorted(tx_priming_counts)

    pcc = defaultdict(Counter)
    for k in tx_priming_counts:
        for (_,_,p), v in tx_priming_counts[k].most_common():
            pcc[k][p] += v

    if include_mito:
        tots = {k: pcc[k].total() for k in keys}
    else:
        tots = {k: pcc[k].total() - pcc[k][Priming.MITO] for k in keys}

    p_set = sorted(Priming, key=lambda p: sum(pcc[k][p] / tots[k] for k in keys), reverse=True)
    if not include_mito:
        p_set.remove(Priming.MITO)

    print(f"{'':16s}", *(f"{' '.join(k):>12}" for k in keys), sep="\t")
    for p in Priming:
        if p is Priming.MITO and not include_mito:
            continue
        print(f"{p.name:16s}", end="\t")
        vs = [pcc[k][p] for k in keys]
        print(*(f"{v / tots[k]:12.2%}" for k,v in zip(keys, vs)), sep="\t")

    print(f"\n{'Total':16s}", *(f"{tots[k]:12,}" for k in keys), sep="\t")

In [None]:
print_priming_calls(tx_priming_counts)

# plotting

In [None]:
tx_class, tx_count, tx_ratio = tx_count_breakdown(tx_priming_counts)

In [None]:
plot_isoform_area(
    sample_order, tx_class, tx_count, tx_ratio,
    output_path=figure_path / "fig2c_priming_area.svg"
)

In [None]:
for txt in IsoQuantClass:
    for k in sample_order:
        tx_list = sorted(
            (tx for tx in tx_count[k] if tx_count[k][tx] >= 5 and tx_class[k][tx] == txt),
            key=tx_ratio[k].get, reverse=True
        )
        
        good_x = np.array([tx_ratio[k][tx] for tx in tx_list])

        print(k[0], str(txt), f"{(good_x == 0).mean():.1%}", sep="\t")
    print()

In [None]:

for j, k in enumerate(sample_order):
    counts_by_type = defaultdict(list)
    for tx in tx_count[k]:
        counts_by_type[tx_class[k][tx]].append(tx_count[k][tx])
    print(k[0])
    for txt in IsoQuantClass:
        print(str(txt), *(f"{v:6,}" for v in np.percentile(counts_by_type[txt], (5, 20, 50, 80, 95)).astype(int)), sep="\t")
    print()


### Condensed priming plot

We created many priming categories based on the different combinations of features that we observed. For figure 2a we condense the results into a few broad categories.

The density of the mitochondrial genome made categorization more difficult, as reads were more likely to overlap multiple features in close proximity. We excluded the mitochondrial reads from this analysis by putting them in their own category and not including it in the total count when considering the rate of different priming events.

In [None]:
priming_rates = defaultdict(dict)
for s in priming_class_counts:
    tot = priming_class_counts[s].total() - priming_class_counts[s][Priming.MITO]
    for p in Priming:
        priming_rates[s][p] = priming_class_counts[s][p] / tot

priming_cats = {
    frozenset({Priming.GOOD, Priming.ANNO_GPA, Priming.TX_PAS}): "Known\ntranscription sites",
    frozenset({Priming.TX_MOTIF}): "Unannotated\ntranscription sites",
    frozenset({Priming.TX_GPA_PAS, Priming.TX_GPA, Priming.TX_GPA_MOTIF, Priming.CDS_GPA, Priming.NC_GPA, Priming.INTERGENIC_GPA, Priming.AS_TX_GPA, Priming.AS_TX_GPA_NC}): "Suspected\ninternal priming",
    frozenset({Priming.AS_TX, Priming.AS_TX_NC}): "Antisense\nto known transcript",
}
priming_cats[frozenset(set(Priming).difference(*priming_cats)) - {Priming.MITO}] = "Other"

In [None]:

fig, ax = plt.subplots(1, 1, figsize=(12, 6), gridspec_kw={"hspace": 0.3})

x = np.arange(len(priming_cats))
w = 0.3

for i, s in enumerate(sample_order):
    ax.bar(
        x + i * w + 0.05,
        [sum(priming_rates[s][p] for p in ps) for ps in priming_cats],
        width=w, 
        color=SAMPLE_COLORS[s[0]], align="edge", label=s[0]
    )

ax.set_xticks(x + 0.5, priming_cats.values())
ax.legend()
ax.set_xlim(-0.5, len(priming_cats) + 0.5)

ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))

plt.savefig(figure_path / "fig2a_priming_rates.svg")
plt.show()