# Fixation experiments

This notebook covers the long-read analysis of the two fixation experiments. The preliminary experiment contained a panel of several different fixation protocols along with fresh PBMCs, hashtagged into a single PIPseq library. From this experiment we selected Methanol+DSP as our preferred fixation method. To validate our choice with a larger number of cells, we did a second experiment of only fresh PBMCs hashtagged along with cells fixed with Methanol+DSP.

For both experiments, we have matched short-read sequencing to identify the hashtags and cell barcodes. We also use the short-read data to cluster and annotate cells using canonical PBMC markers, then translate that annotation to the long-read data.

### Raw data

 * Short-read data in `gs://mdl-sc-isoform-2025-ms/fix-fresh_illumina` (GEX = gene expression, HTO = hashtag file)
   * Fixation panel: `MDL_FixationPanel_*`
   * Methanol+DSP vs fresh PBMCs: `MDL_FixFresh_*`
 * Long-read data in `gs://mdl-sc-isoform-2025-ms/fix-fresh_masseq`
   * Fixation panel: `m84175_240224_082414_s4.hifi_reads.bcM0004.bam`
   * Methanol+DSP vs fresh PBMCs: `m84250_240530_044215_s1.hifi_reads.bcM0001.bam` and `m84250_240530_064125_s2.hifi_reads.bcM0001.bam`

To process short-read data, we demultiplexed with `bcl2fastq2` and ran PIPseeker:

For the panel of fixation protocols:
```
./pipseeker-v3.3.0-linux/pipseeker full --output-path MDL_FixationPanel \
    --fastq MDL_FixationPanel_GEX \
    --hto-fastq MDL_FixationPanel_HTO \
    --hto-tags metadata/hto_tags_1-6.csv \
    --star-index-path reference/GRCh38 \
    --chemistry v4
```

For the second comparison between Methanol+DSP and fresh cells:
```
./pipseeker-v3.3.0-linux/pipseeker full --output-path MDL_FixFresh \
    --fastq MDL_FixFresh_GEX \
    --hto-fastq MDL_FixFresh_HTO \
    --hto-tags metadata/hto_tags_7-8.csv \
    --star-index-path reference/GRCh38 \
    --chemistry v4
```

To process the long-read data, we follow the same recipe as the previous notebooks:

* Deconcatenate reads with `skera`
* Classify `proper` reads with `marti`
* Extract barcodes and UMIs with `bouncer`

After these steps, we use the clustering from short-read data to identify CD14+ monocytes and extract UMI counts and read length distributions for those cells, as well as for other cells.

### imports

In [None]:
import csv
import gzip
import itertools
import pickle
from collections import defaultdict
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import sparse
import yaml

import pysam

import mdl.sc_isoform_paper.util as util
from mdl.sc_isoform_paper import today
from mdl.sc_isoform_paper.marti import CONFIG_DICT, SAMPLE_CONFIG
from mdl.sc_isoform_paper.pipseq_barcodes import barcode_to_sequence, sequence_to_int, barcode_to_int
from mdl.sc_isoform_paper.plots import plot_dists

# a bunch of functions for single-cell analysis
# installed from www.github.com/methodsdev/isoscelles
from mdl.isoscelles.io import read_mtx
from mdl.isoscelles.leiden import recursive_cluster, cluster_leaf_nodes, cluster_labels


In [None]:
pysam.set_verbosity(0)

root_dir = Path.home()
data_path = root_dir / "data" / "20240304_fixpip"
figure_path = root_dir / "202501_figures"

## skera, marti, barcode extraction

In [None]:
raw_bam = data_path / "raw" / "m84175_240224_082414_s4.hifi_reads.bcM0004.bam"

base = raw_bam.name.split(".")[0]

skera_path = data_path / f"{today}_skera"
skera_path.mkdir(exist_ok=True)

skera_bam = skera_path / f"{base}.skera.bam"

# path to the marti binary
marti_bin = root_dir / "marti/build/bin/marti"

# pipseq barcode file
pipseq_barcode_file = root_dir / "metadata" / "fluent_barcodes.txt.gz"

# path to extract_barcodes
extract_barcodes_bin = Path("/opt/conda/envs/fa/bin/extract_barcodes")

In [None]:
# remove the `echo` to run skera
! echo skera split -j 16 {raw_bam} metadata/mas16_primers.fasta {skera_bam}

#### marti

We run `marti` as before, with settings for PIPseq.

In [None]:
marti_path = data_path / f"{today}_marti"
marti_path.mkdir(exist_ok=True)

# path to the BAM that will be created
marti_bam = marti_path / skera_bam.with_suffix(".classified.bam").name

In [None]:
config_file = marti_path / "config.yaml"

# write config file with appropriate parameters
with open(config_file, "w") as out:
    print(
         yaml.dump(
            {"input_bam": str(skera_bam)}
            | SAMPLE_CONFIG['PIPseq']
            | CONFIG_DICT,
            sort_keys=False
        ),
        file=out
    )

In [None]:
# remove the `echo` to run marti
! echo {marti_bin} {config_file}

#### Barcode extraction

Similarly we run `extract_barcodes` with the settings for a PIPseq experiment.

In [None]:
cdna_path = data_path / f"{today}_cdna"
cdna_path.mkdir(exist_ok=True)

# path to the BAM that will be created
tagged_bam = cdna_path / marti_bam.with_suffix(".tagged.bam").name

In [None]:
barcode_config_file = data_path / f"{today}_barcode_config.yaml"

config = dict(
    sample_type='PIPseq',
    barcode_file=str(pipseq_barcode_file),
    umi_size=12,
    buffer_size=56,
    bam_paths=[str(marti_bam)],
)

with open(barcode_config_file, "w") as out:
    yaml.dump(config, out)

In [None]:
# remove the `echo` to run extract_barcodes
! echo {extract_barcodes_bin} --config-file {barcode_config_file} --output-dir {cdna_path}

## Match barcodes from short-read analysis

After running PIPseeker the fixation panel was annotated in R, see the script in the paper repository for details. We will need the annotation file here to match barcodes to the long-read data.

In [None]:
# download the annotation
! echo gcloud storage cp gs://mdl-sc-isoform-data/... {data_path}/

In [None]:
with open(data_path / "PIPseeker-fixpip_run01_Metadata_annotated.csv") as fh:
    md_rows = list(csv.reader(fh))

bc_to_exp = {barcode_to_sequence(r[0]): r[11] for r in md_rows[1:]}

exps = sorted(set(bc_to_exp.values()))

bc_to_label = {barcode_to_sequence(r[0]): r[14].split("_")[0] for r in md_rows[1:]}

bcs_by_label = defaultdict(set)
for bc, exp in bc_to_exp.items():
    bcs_by_label[exp, bc_to_label[bc]].add(bc)

In [None]:
nice_exp = {
    'Fresh': "Fresh", 
    "Meth": 'Methanol',
    "Meth-DSP": 'Methanol\nDSP',
    'Meth-DSP-DEPC': "Methanol\nDSP & DEPC",
    'PFA-01p5min': "0.1% PFA\n5 min",
    'PFA-1p-5min': "1% PFA\n5 min"   
}

In [None]:
sr_barcodes = [barcode_to_sequence(r[0]) for r in md_rows[1:]]

assert len(sr_barcodes) == len(set(sr_barcodes))
sr_barcode_set = set(sr_barcodes)
sr_numis = np.array([int(r[2]) for r in md_rows[1:]])

In [None]:
umis_per_bc = defaultdict(set)
with pysam.AlignmentFile(tagged_bam, "rb", check_sq=False, threads=8) as fh:
    for a in fh:
        if (bc := a.get_tag("CB")) in sr_barcode_set:
            umis_per_bc[bc].add(((a.query, a.get_tag("UB"))))


In [None]:
bc_to_read_len = defaultdict(list)
for bc in umis_per_bc:
    bc_to_read_len[bc] = [len(q) for q, _ in umis_per_bc[bc] if q is not None]


In [None]:
lr_bc_set = set(umis_per_bc)
lr_bc_to_umi = {bc: len(umis_per_bc[bc]) for bc in umis_per_bc}

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 5), gridspec_kw={"hspace": 0.5})

label = "MonoCD14"

umi_dists = [
    [lr_bc_to_umi[bc] for bc in bcs_by_label[exp, label] & lr_bc_set]
    for exp in exps
]

plot_dists(
    ax, 
    umi_dists, 
    log=True,
    labels=[f"{nice_exp[exp]}" for exp in exps],
    title="UMIs/cell for CD14+ Monocytes"
)

ax.set_yticks([0, 1, 2, 3], minor=False)
ax.set_yticks(
    np.log10([v*10**i for i in range(4) for v in range(2, 10)][:-2]),
    minor=True
)
ax.set_ylabel("#UMIs")

plt.savefig(figure_path / "supp_fig8_fixation_umis.svg")
plt.show()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16, 5), gridspec_kw={"hspace": 0.5})
label = "MonoCD14"

rl_dists = [
    [v for bc in bcs_by_label[exp, label] for v in bc_to_read_len[bc]]
    for exp in exps
]

p = plot_dists(
    ax,
    rl_dists,
    labels=[f"{nice_exp[exp]}" for exp in exps],
    title="Read lengths for CD14+ Monocytes"
)

ax.set_ylim(bottom=-20, top=1220)
ax.set_ylabel("Read Length (bp)")

plt.savefig(figure_path / "supp_fig9_fixation_readlen.svg")
plt.show()

# Second fixation experiment

After identifying Methanol + DSP as the most promising fixation method, we did a second experiment to confirm the results with a larger sample. This time we only compared two conditions: fresh PBMCs vs Methanol+DSP fixed. The two samples were hashtagged and captured together in a single PIPseq experiment.

**Note** This dataset was analyzed with a later version of PIPseeker, and the mapping from PIPseeker barcode (encoded as 16bp) to real sequence (39bp) changed since the previous version. This meanas we must slightly adjust how we match barcodes between short- and long-read data.

In [None]:
data_dir = root_dir / "data" / "pipseq_fixfresh"

shortread_dir = data_dir / "illumina"
masseq_dir = data_dir / "masseq"
cdna_path = masseq_dir / "cdna"

## Shortread analysis

Here we are using the "sensitivity 5" filtering from PIPseeker.

In [None]:
markers = [
    "CD8A", "CD3D", "CD4", "CD14", "CCR7", "SELL",
    "FCGR3A", "CLEC10A", "CD1C", "NKG7", "NCAM1", "GNLY",
    "MS4A1", "IGKC", "IGLC2", "IRF7", "IGHM", "JCHAIN",
]


In [None]:
with gzip.open(shortread_dir / "HTO" / "raw_matrix" / "features.tsv.gz", "rt") as fh:
    sr_features = list(csv.reader(fh, delimiter="\t"))[:-2]

gene_dict_i = {g[1]: i for i,g in enumerate(sr_features)}
mt_ix = [gene_dict_i[g[1]] for g in sr_features if g[1].startswith("MT-")]

len(sr_features), len(mt_ix)

In [None]:
with gzip.open(shortread_dir / "HTO" / "demux" / "HTO7" / "filtered_matrix" / "sensitivity_5" / "barcodes.tsv.gz", "rt") as fh:
    hto7_barcodes_full = [barcode_to_int(line.strip()) for line in fh]

with gzip.open(shortread_dir / "HTO" / "demux" / "HTO8" / "filtered_matrix" / "sensitivity_5" / "barcodes.tsv.gz", "rt") as fh:
    hto8_barcodes_full = [barcode_to_int(line.strip()) for line in fh]

len(hto7_barcodes_full), len(hto8_barcodes_full)

In [None]:
m_hto7 = read_mtx(shortread_dir / "HTO" / "demux" / "HTO7" / "filtered_matrix" / "sensitivity_5" / "matrix.mtx.gz")
m_hto8 = read_mtx(shortread_dir / "HTO" / "demux" / "HTO8" / "filtered_matrix" / "sensitivity_5" / "matrix.mtx.gz")

mt_hto7 = util.calc_mt_pct(m_hto7, mt_ix)
mt_hto8 = util.calc_mt_pct(m_hto8, mt_ix)

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(16, 5))

ax[0].plot(sorted(m_hto7.sum(1).todense(), reverse=True), label="fresh")
ax[0].plot(sorted(m_hto8.sum(1).todense(), reverse=True), label="fixed")
ax[0].set_xscale("log")
ax[0].set_yscale("log")
ax[0].set_title("Fixed vs Fresh kneeplot")
ax[0].legend()

ax[1].hexbin(mt_hto7, m_hto7.sum(1).todense(), bins="log", extent=[0, 0.2, 0, 40000])
ax[1].set_title("Fresh, nUMIs vs mito %")
ax[2].hexbin(mt_hto8, m_hto8.sum(1).todense(), bins="log", extent=[0, 0.2, 0, 40000])
ax[2].set_title("Fixed, nUMIs vs mito %")
plt.show()

We can see a considerable number of low-mito, high-UMI barcodes in the fixed data. These barcodes are red blood cells (confirmed by checking markers such as hemoglobin genes). We can filter them out by imposing a minimum mitochondrial content of 2%.

In [None]:
hto7_barcodes = [bc for bc, m in zip(hto7_barcodes_full, mt_hto7) if 0.02 < m < 0.1]
hto8_barcodes = [bc for bc, m in zip(hto8_barcodes_full, mt_hto8) if 0.02 < m < 0.1]

m_hto7_filtered = m_hto7[(0.02 < mt_hto7) & (mt_hto7 < 0.1)]
m_hto8_filtered = m_hto8[(0.02 < mt_hto8) & (mt_hto8 < 0.1)]

m_filtered = sparse.concat([m_hto7_filtered, m_hto8_filtered])
sr_numis = m_filtered.sum(1).todense()

hto_fresh = np.repeat([True, False], [len(hto7_barcodes), len(hto8_barcodes)])

### Clustering

In [None]:
clustering_file = shortread_dir / "shortread_clustering.pickle"
if clustering_file.exists():
    with clustering_file.open("rb") as fh:
        sr_clustering = pickle.load(fh)
else:
    res_list = [float(f"{b}e{p}") for p in range(-7, -2) for b in range(1, 10)]
    sr_clustering, _ = recursive_cluster(m_filtered, res_list, feature_cutoff_pct=0.05)

    with clustering_file.open("wb") as out:
        pickle.dump(sr_clustering, out)

_leaf_keys = cluster_leaf_nodes(sr_clustering, n=80)
_label_array = cluster_labels(sr_clustering, _leaf_keys)
_k2i = {key: i for i, key in enumerate(sorted(_leaf_keys))}
c_array = np.array([_k2i.get(key, -1) for key in _label_array])

We'll label the clusters by checking some canonical markers, as before.

In [None]:
# mean UMIs/cell for each isoform x cluster
pseudobulk_array = np.vstack(
    [np.sign(m_filtered[c_array == i, :]).mean(0).todense() for i in np.unique(c_array) if i != -1]
)

print("gene", *(f"{v:6}" for v in range(pseudobulk_array.shape[0])), sep="\t")
for g in markers:
    print(
        g,
        "\t".join(
            f"{pseudobulk_array[i, gene_dict_i[g]]:6.1%}" 
            for i in range(pseudobulk_array.shape[0])
        ), 
        sep="\t"
    )

Of note in this pseudobulk output is that some monocyte markers (e.g. _LYZ_, _FCGR3A_) are seen in all clusters. This reflects a high amount of ambient mRNA coming from those cells as they lysed during sample preparation. Still, we can identify cluster 4 as the CD14+ monocyte cluster due to the much higher expression level for the marker.

## Long-read analysis

The first steps for long-read analysis are the same as before: we run `skera` to deconcatenate the Kinnex array, and then `marti` to annotate the reads with adapters. After that, we must extract barcodes and UMIs to match the reads up with our short-read annotation. This PIPseq dataset was relatively large and barcode extraction takes a long time, so to speed it up we are going to split the BAMs into 16 pieces each and run in parallel over all of them.

### Running skera and marti

In [None]:
with open(sh_dir / f"{today}_skera.sh", "w") as out:
    for mcb in callao_bams:
        base = mcb.name.split(".")[0]
        ix = mcb.name.rsplit(".", 2)[1]

        # echo the command so we see progress
        print(
            f"echo skera split -j 16 {mcb}",
            f"metadata/mas16_primers_{ix}.fasta",
            skera_path / f"{base}.skera.{ix}.bam",
            file=out
        )
        print(
            f"skera split -j 16 {mcb}",
            f"metadata/mas16_primers_{ix}.fasta",
            skera_path / f"{base}.skera.{ix}.bam",
            file=out
        )

In [None]:
# tagged bams
tagged_bams = sorted(cdna_path.glob("*.bam"))

len(tagged_bams)

### looking at CD14 monocyte stats

In [None]:
hto7_c4_numis = sr_numis[hto_fresh & (c_array == 4)]
hto8_c4_numis = sr_numis[~hto_fresh & (c_array == 4)]

hto7_c4_set = {bc for bc,c in zip(hto7_barcodes, c_array[hto_fresh]) if c == 4}
hto8_c4_set = {bc for bc,c in zip(hto8_barcodes, c_array[~hto_fresh]) if c == 4}


In [None]:
def rn_to_readumi_bcset(tagged_bam, bc_set):
    bc_reads = defaultdict(set)
    with pysam.AlignmentFile(tagged_bam, "rb", check_sq=False, threads=2) as fh:
        for a in fh:
            if (bc := sequence_to_int(a.get_tag("CB"))) in bc_set:
                bc_reads[bc].add((a.query, a.get_tag("UB")))

    return bc_reads

In [None]:
sample_reads = defaultdict(lambda: defaultdict(set))

with ProcessPoolExecutor(8) as exc:
    for rld in exc.map(rn_to_readumi_bcset, tagged_bams, itertools.repeat(hto7_c4_set)):
        for bc in rld:
            sample_reads["fresh_cd14"][bc].update(rld[bc])

    for rld in exc.map(rn_to_readumi_bcset, tagged_bams, itertools.repeat(hto8_c4_set)):
        for bc in rld:
            sample_reads["fixed_cd14"][bc].update(rld[bc])

In [None]:
sample_read_lens = defaultdict(dict)
for k in sample_reads:
    for bc in sample_reads[k]:
        sample_read_lens[k][bc] = [len(q) for q,_ in sample_reads[k][bc] if q is not None]

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(20, 6))

plot_dists(
    ax[0],
    [hto7_c4_numis, hto8_c4_numis],
    log=True,
    labels=["Fresh", "Fixed"],
    title="nUMIs per cell (Illumina)"
)
ax[0].set_yticks([3, 4], ["$10^3$", "$10^4$"])
ax[0].set_yticks(
    np.log10([v*10**i for i in range(3, 5) for v in range(2, 10)][:-4]),
    minor=True
)


plot_dists(
    ax[1],
    [[len(v) for v in sample_reads[k].values()] for k in sample_reads],
    log=True,
    labels=["Fresh", "Fixed"],
    title="nUMIs per cell (PB)"
)
ax[1].set_yticks([2, 3, 4], ["$10^2$", "$10^3$", "$10^4$"])
ax[1].set_yticks(
    np.log10([v*10**i for i in range(1, 5) for v in range(2, 10)][1:-6]),
    minor=True
)


plot_dists(
    ax[2],
    [
        [v for vs in sample_read_lens[k].values() for v in vs]
         for k in ("fresh_cd14", "fixed_cd14")
    ],
    labels=["Fresh", "Fixed"],
    title="Read length (PB)"
)

ax[2].set_ylim(0, 800)

fig.suptitle("CD14+ cluster stats")
plt.savefig(figure_path / "fig1h_fixfresh_comparison.svg")
plt.show()