## Monomer processing

__Long-read monomer PBMC__ - run on two Sequel IIe flowcells, `m64455e_230922_195123` and `m64455e_230924_064453`
   * Raw data: [`gs://mdl-sc-isoform-2025-ms/sequencing_data/pbmc_monomer`](https://console.cloud.google.com/storage/browser/mdl-sc-isoform-2025-ms/sequencing_data/pbmc_monomer)
   * Contains six samples
     1. Fluent 0.8x SPRI
     2. Fluent 0.6x SPRI
     3. 10x 3' w/ purification
     4. 10x 5' w/ purification
     5. 10x 3' w/o purification
     6. 10x 5' w/o purification

### Processing

First step is to tag the flowcells with `lima` to identify the samples, using the demux primers found in the `metadata` folder:

```bash
lima --no-clip m64455e_230922_195123.hifi_reads.bam mas_demux_primers.fasta m64455e_230922_195123.lima.bam
lima --no-clip m64455e_230924_064453.hifi_reads.bam mas_demux_primers.fasta m64455e_230924_064453.lima.bam
```

**Note** `lima` has some funny ideas about file-naming, it might add another `lima` to these so they end with `lima.lima.bam`, check the output folder.

Next step is to run `callao` to split the monomers per-index, including artifacts (A-A and Q-Q adapters):

```bash
callao --include-artifacts --input-bam path/to/m64455e_230922_195123.lima.bam --barcode-fasta mas_demux_primers.fasta --output-stem path/to/m64455e_230922_195123.callao.bam {1..6}
callao --include-artifacts --input-bam path/to/m64455e_230924_064453.lima.bam --barcode-fasta mas_demux_primers.fasta --output-stem path/to/m64455e_230924_064453.callao.bam {1..6}
```

From there, we need to run `marti` on the results. This step requires sample-specific configuration so it's easiest to set up via code.

The output from `marti` is available to download at:
 * [`gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/pbmc_marti_outputs.tgz`](https://console.cloud.google.com/storage/browser/mdl-sc-isoform-2025-ms/notebook_checkpoints/pbmc_marti_outputs.tgz)
 * [`gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/pbmc_marti_unk_outputs.tgz`](https://console.cloud.google.com/storage/browser/mdl-sc-isoform-2025-ms/notebook_checkpoints/pbmc_marti_unk_outputs.tgz).

In [None]:
from pathlib import Path
from collections import Counter

import yaml

from mdl.sc_isoform_paper import today
from mdl.sc_isoform_paper.constants import MONOMER_KEYS
from mdl.sc_isoform_paper.marti import CONFIG_DICT, SAMPLE_CONFIG, total_samples
from mdl.sc_isoform_paper.plots import plot_artifacts

In [None]:
root_dir = Path.home()
sh_dir = root_dir / "sh_scripts"

data_path = root_dir / "data" / "monomer"
figure_path = root_dir / "202501_figures"

# path to the marti binary
marti_bin = root_dir / "marti/build/bin/marti"

# path for marti output
marti_path = data_path / f"{today}_marti"
marti_path.mkdir(exist_ok=True)

# paths for input files, the output from callao
callao_bams = sorted((data_path / "callao").glob("*bam"))
callao_bams

# uncomment to download and extract into the data path
! # gcloud storage cp gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/pbmc_marti_outputs.tgz - | tar -C {marti_path} -xzv

### Marti Configuration

We used an error rate of 0.05 and a shorter polyA match, as these settings appeared to yield more reliable results. Because the Fluent PIPseq libraries have a very long barcode we increased the terminal search buffers to 150.

Based on the sample number (which maps to a chemistry type) we configure the adapters for the config file:

  1. Fluent 0.8x SPRI
  2. Fluent 0.6x SPRI
  3. 10x 3' w/ purification
  4. 10x 5' w/ purification
  5. 10x 3' w/o purification
  6. 10x 5' w/o purification


In [None]:
with open(sh_dir / f"{today}_monomer_marti.sh", "w") as out:
    for cb in callao_bams:
        i = int(cb.name.split(".")[2])
        mp = marti_path / cb.stem
    
        # make a run directory for each file
        mp.mkdir(exist_ok=True, parents=True)
        config_file = mp / "config.yaml"

        # write config file with appropriate parameters
        with open(config_file, "w") as out2:
            print(
                yaml.dump(
                    {"input_bam": str(cb)}
                    | SAMPLE_CONFIG[MONOMER_KEYS[i][0]]
                    | CONFIG_DICT,
                    sort_keys=False
                ),
                file=out2
            )
        print(f"{marti_bin} {config_file}", file=out)


### Artifact comparison

After we run `marti` we can accumulate the results from the `structure_counts.tsv` files and plot them.

In [None]:
# the ordering of these indices is to group the output nicely
full_sample_order = [MONOMER_KEYS[i] for i in [5, 6, 3, 4, 1, 2]]

# for brevity, we plot only three samples: unpurified 10x 3' and 5', and the 0.8x SPRI PIPseq
sample_order = [MONOMER_KEYS[i] for i in [5, 6, 1]]

In [None]:
sample_totals = total_samples(marti_path, MONOMER_KEYS)

# artifact types to plot: stuff over 1% in one of the samples
overall_total = sum((sample_totals[s] for s in sample_order), start=Counter())
key_list = [k for k,_ in overall_total.most_common() if any(sample_totals[s][k] / sample_totals[s].total() > 0.01 for s in sample_order)][1:]

plot_artifacts(
    sample_order, sample_totals, key_list, title="PBMC monomer artifacts",
)

The plot shows the percentage of various artifacts observed in this dataset (we are not plotting `Proper` reads, which are the majority in all cases). It's clear that 10x 3' has a large proportion of TSO-TSO artifacts, while 10x 5' has a few and also a couple other types. The PIPseq data has a significant number of `OnlyPolyA` reads, which are really due to RNA degradation in the monocyte cells.

One thing that's of interest here is the number of Unknown reads. When we inspected the structures responsible for those reads, we saw that they often appear to be TSO-TSO reads that were classified as `Unk` because they have additional `polyA` or `polyT` annotations. This likely due to the short (10bp) threshold we used for the initial classification. So, we did an additional round of `marti` classification on just the Unknown reads, using default settings for the polyA.

In [None]:
marti_bams = sorted(marti_path.glob("*/*classified.bam"))
marti_path_unk = data_path / f"{today}_marti_unk"
marti_path_unk.mkdir(exist_ok=True)

default_polya_len = { "min_polyA_match": 20 }

# uncomment to download and extract to marti_path_unk
! # gcloud storage cp gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/pbmc_marti_unk_outputs.tgz - | tar -C {marti_path_unk} -xzv

In [None]:
with open(sh_dir / f"{today}_monomer_marti_unk.sh", "w") as out:
    for mb in marti_bams:
        i = int(mb.name.split(".")[2])
        mp = marti_path_unk / mb.stem
        mp.mkdir(exist_ok=True, parents=True)
        config_file = mp / "config.yaml"

        # use samtools to filter out the unknown reads
        unk_mb = marti_path_unk / mb.with_suffix(".unknown.bam").name
        print(f"samtools view -b -h --tag lb:Unk {mb} > {unk_mb}", file=out)

        # write config file with appropriate parameters
        with open(config_file, "w") as out2:
            print(
                yaml.dump(
                    {"input_bam": str(unk_mb)} 
                    | SAMPLE_CONFIG[MONOMER_KEYS[i][0]]
                    | CONFIG_DICT
                    | default_polya_len,
                    sort_keys=False
                ),
                file=out2
            )
        print(f"{marti_bin} {config_file}", file=out)


After running `marti` again we read in the reclassified results and add them to the previous ones, being careful to remove the `Unk` counts from before.

In [None]:
unk_sample_totals = total_samples(marti_path_unk, MONOMER_KEYS)
reclassified_sample_totals = dict()
for s in sample_totals:
    reclassified_sample_totals[s] = sample_totals[s].copy()
    reclassified_sample_totals[s].pop("Unk")
    reclassified_sample_totals[s] += unk_sample_totals[s]
    assert reclassified_sample_totals[s].total() == sample_totals[s].total()

In [None]:
# this code prints out the counts for Supplementary Table 2
overall_total = sum(reclassified_sample_totals.values(), start=Counter())
keys2 = [k for k,_ in overall_total.most_common()]
max_len = max(len(k) for k in keys2)

print(f"{'':{max_len}}", *(f"{s:>10}" for s,_ in full_sample_order), sep="\t")
print(f"{'':{max_len}}", *(f"{s:>10}" for _,s in full_sample_order), sep="\t")
print(f"{'total':{max_len}}", *(f"{reclassified_sample_totals[s].total():10,}" for s in full_sample_order), sep="\t")
print(f"{'read classification':{max_len}}")
for k in keys2:
    print(f"{k:{max_len}}", *(f"{reclassified_sample_totals[s][k]:10,}" for s in full_sample_order), sep="\t")

In [None]:
# artifact types to plot: stuff over 1% in one of the samples
key_list = [k for k,_ in overall_total.most_common() if any(reclassified_sample_totals[s][k] / reclassified_sample_totals[s].total() > 0.01 for s in sample_order)][1:]

plot_artifacts(
    sample_order, reclassified_sample_totals, key_list, title="PBMC monomer artifacts",
    output_file=figure_path / "fig1d_artifact_rates.svg"
)

After merging in the reclassification, we see that most of the Unknown reads are now assigned to known structures.