## MAS-seq processing

__MAS-seq PBMC__ - run on eight Revio flowcells
   * Raw data: [`gs://mdl-sc-isoform-2025-ms/sequencing_data/pbmc_masseq`](https://console.cloud.google.com/storage/browser/mdl-sc-isoform-2025-ms/sequencing_data/pbmc_masseq)
   * Contains four samples
     * Fluent PIPseq 0.8x SPRI
     * Fluent PIPseq 0.6x SPRI
     * 10x 3'
     * 10x 5'

The second set of SMRTcells for the MAS-seq PBMC were run at different concentrations as an optimization experiment, but there was no particular relationship between concentration and output, and the data were incorporated normally.

| flowcell | concentration | # reads |
|-|-|-|
| `m84143_230929_195310_s1` | 350 pM | 3,122,400 |
| `m84143_230929_202333_s2` | 400 pM | 4,098,164 |
| `m84143_230929_205439_s3` | 450 pM | 3,361,489 |
| `m84143_230929_212545_s4` | 500 pM | 3,036,894 |

The raw data in the `pbmc_masseq` directory had been needlessly split by PB barcode--at the time of sequencing this was automatically performed by the Revio instrument. The files were re-unified with `lima-undo`.

### Processing

The zero-th step is to unify the `default` and `unassigned` BAMs with `lima-undo`:

```bash
mkdir pbmc_masseq/undo
lima-undo -j 16 \
    pbmc_masseq/m84063_230829_195231_s1.hifi_reads.default.bam \
    pbmc_masseq/m84063_230829_195231_s1.hifi_reads.unassigned.bam \
    pbmc_masseq/undo/m84063_230829_195231_s1.undo.bam
...
```

The First step is to tag the flowcells with `lima` to identify the samples, using the demux primers found in the `resources` folder:

```bash
mkdir pbmc_masseq/lima
lima -j 16 --no-clip \
    pbmc_masseq/undo/m84063_230829_195231_s1.undo.bam \
    metadata/mas_demux_primers.fasta \
    pbmc_masseq/lima/m84063_230829_195231_s1.lima.bam
...
```

**Note** `lima` has some quirks with file-naming, it might add another `lima` to these so they end with `lima.lima.bam`, check the output folder.

Next step is to run `callao` to split the samples per-index, including artifacts (A-A and Q-Q adapters):

```bash
mkdir pbmc_masseq/callao
callao --include-artifacts \
    --barcode-fasta metadata/mas_demux_primers.fasta \
    --input-bam pbmc_masseq/lima/m84063_230829_195231_s1.lima.bam \
    --output-stem pbmc_masseq/callao/m84063_230829_195231_s1.callao.bam \
    {1..4}
...
```

Next, we run `skera` on each of the samples. This requires an index-specific adapter file, as the `A` and `Q` adapters are different based on the index we used. The necessary files are in the `metadata.tgz` file. We will create a bash script for running `skera` in this notebook.

From there, we need to run `marti` on the results. This step requires sample-specific configuration so it's easiest to set up via code.

### imports

In [None]:
import pickle
from collections import defaultdict
from pathlib import Path

import numpy as np

import matplotlib.pyplot as plt

import yaml

from mdl.sc_isoform_paper import today
from mdl.sc_isoform_paper.constants import MASSEQ_KEYS
from mdl.sc_isoform_paper.marti import CONFIG_DICT, SAMPLE_CONFIG
from mdl.sc_isoform_paper.skera import read_length_csv, read_ligation_csv
from mdl.sc_isoform_paper.plots import plot_concat_and_ligations

In [None]:
root_dir = Path.home()
sh_dir = root_dir / "sh_scripts"

data_path = root_dir / "data" / "masseq"
figure_path = root_dir / "202501_figures"

# path for output of skera
skera_path = data_path / f"{today}_skera"
skera_path.mkdir(exist_ok=True)

# path to the marti binary
marti_bin = root_dir / "marti/build/bin/marti"

# path for marti output
marti_path = data_path / f"{today}_marti"
marti_path.mkdir(exist_ok=True)

# paths for input files, the output from callao
callao_bams = sorted((data_path / "callao").glob("*bam"))
callao_bams

# uncomment to download and extract into the data path
! # gcloud storage cp gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/3dg_marti_outputs.tgz - | tar -C {marti_path} -xzv

### deconcat the multimer libraries and run marti

In [None]:
with open(sh_dir / f"{today}_skera.sh", "w") as out:
    for mcb in callao_bams:
        base = mcb.name.split(".")[0]
        ix = mcb.name.rsplit(".", 2)[1]

        # echo the command so we see progress
        print(
            f"echo skera split -j 16 {mcb}",
            f"metadata/mas16_primers_{ix}.fasta",
            skera_path / f"{base}.skera.{ix}.bam",
            file=out
        )
        print(
            f"skera split -j 16 {mcb}",
            f"metadata/mas16_primers_{ix}.fasta",
            skera_path / f"{base}.skera.{ix}.bam",
            file=out
        )

## MAS-seq QC

After running `skera` we make some basic QC plots based on the output. To re-generate the figures without processing these files at all, download and load `pbmc_skera_stats.pickle` file from `gs://mdl-sc-isoform-2025-ms/notebook_checkpoints`, and skip to the next section.

In [None]:
sample_names = [MASSEQ_KEYS[i] for i in (1, 2, 3, 4)]

skera_stats_file = data_path / "pbmc_skera_stats.pickle"

# defining bins for the length histograms
bins = np.linspace(0, 2500, 126)

# uncomment to download
! # gcloud storage cp gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/pbmc_skera_stats.pickle.tgz {data_path}/

In [None]:
if skera_stats_file.exists():
    with open(skera_stats_file, "rb") as fh:
        counts, sread_hists, sread_percentiles, read_lengths = pickle.load(fh)
else:
    skera_ligations = sorted(skera_path.glob("*ligations.csv"))
    skera_read_len = sorted(skera_path.glob("*read_lengths.csv"))

    counts = dict()
    sread_hists = dict()
    sread_percentiles = dict()

    read_lengths = defaultdict(list) 
    sread_lengths = defaultdict(list)

In [None]:
if len(counts) == 0:
    for i in MASSEQ_KEYS:
        counts[MASSEQ_KEYS[i]] = np.dstack(
            [read_ligation_csv(skl, 16) for skl in skera_ligations
             if int(skl.name.split(".")[2]) == i]
        ).sum(2)

In [None]:
if len(read_lengths) == 0:
    read_lengths = defaultdict(list)
    sread_lengths = defaultdict(list)
    for srl in skera_read_len:
        i = int(srl.name.split(".")[2])
        read_lens, sread_lens = read_length_csv(srl)
        read_lengths[MASSEQ_KEYS[i]].extend(read_lens)
        sread_lengths[MASSEQ_KEYS[i]].extend(sread_lens)

    read_lengths = {k: np.array(read_lengths[k]) for k in sample_names}
    sread_lengths = {k: np.array(sread_lengths[k]) for k in sample_names}

    for k in sample_names:
        sread_percentiles[k] = np.percentile(sread_lengths[k][:, 0], (5, 50, 95))
        sread_hists[k] = np.histogram(sread_lengths[k][:, 0], bins=bins)[0]

In [None]:
# save the stats file
if not skera_stats_file.exists():
    with open(skera_stats_file, "wb") as out:
        pickle.dump(
            (counts, sread_hists, sread_percentiles, read_lengths), out
        )

### Looking at read length and s-read length distributions

We'll print out some statistics on the array deconcatenation. Always worth checking that the arrays were constructed properly.

In [None]:
# mean and 90% interval for length of arrays
for k in sample_names:
    print(
        f"{' '.join(k):12s}",
        f"{np.percentile(read_lengths[k][:, 0], (5, 90, 95))}",
        sep="\t"
    )

In [None]:
# proportion of arrays that are 16-mer
for k in sample_names:
    print(
        f"{' '.join(k):12s}",
        f"{len(read_lengths[k]):,d}", 
        f"{(read_lengths[k][:, 1] == 16).sum() / len(read_lengths[k]):.2%}",
        sep="\t"
    )

In [None]:
# median and 90% interval for s-read length
for k in sample_names:
    print(
        f"{' '.join(k):12s}",
        f"{sread_percentiles[k]}",
        sep="\t"
    )

In [None]:

fig, ax = plt.subplots(1, 1, figsize=(18, 5))

ax.hist(
    [bins[:-1] for _ in sample_names],
    weights=[sread_hists[k] for k in sample_names],
    label=[" ".join(k) for k in sample_names],
    histtype="step", alpha=0.5, bins=bins, density=True, linewidth=2
)

ax.legend()
plt.title("Read length distribution for MASseq data")
plt.savefig(figure_path / "supp_fig4_read_len_distribution.svg")
plt.show()


Here we look at the overall read length distribution for s-reads from these experiments. The mRNA degradation in the PIPseq data is clearly visible as a large density of shorter reads.

It's difficult to see much more detail in terms of the different technologies--they all have a similar distribution for the longer reads.

### Ligation histograms and heatmaps

Here we make the standard QC plots for MAS-seq: the distribution of array length (separated by array size) and the ligation heatmaps

In [None]:
for k in sample_names:
    plot_concat_and_ligations(read_lengths[k], counts[k], True)
    plt.suptitle(" ".join(k))
    plt.savefig(figure_path / f"supplementary_fig5_{'-'.join(k)}.svg")
    plt.show()

These plots are the standard QC plots for MAS-seq experiments, showing that all four of these runs are primarily 16-mer arrays and there are essentially no arrays with incorrectly-paired adapters.

There are slightly more short arrays in the PIPseq data than in the 10x samples. This is likely due to the fact that no artifact purification was done on the PIPseq libraries--the artifact rate for PIPseq is low enough that purification is not necessary, but the array-disrupting artifacts are not entirely absent.

# Annotating reads with marti

After running `skera`, we need to run `marti`, similar to how we processed the monomer data.

In [None]:
skera_bams = list(skera_path.glob("*skera.bam"))
skera_bams

In [None]:
with open(sh_dir / f"{today}_marti.sh", "w") as out:
    for sb in skera_bams:
        i = int(sb.name.rsplit(".", 2)[1])
        mp = marti_path / sb.stem

        # make a run directory for each file
        mp.mkdir(exist_ok=True)
        config_file = mp / "config.yaml"

        # write config file with appropriate parameters
        with open(config_file, "w") as out2:
            print(
                 yaml.dump(
                    {"input_bam": str(sb)}
                    | SAMPLE_CONFIG[MASSEQ_KEYS[i][0]]
                    | CONFIG_DICT,
                    sort_keys=False
                ),
                file=out2
            )

        print(f"{marti_bin} {config_file}", file=out)


After `marti` has completed we can move on to barcode matching/tagging and mapping with `minimap2`.