## Analysis of 3dG vs standard 10x TSO

This experiment is part of figure 3, as a solution to the high rate of TSO artifacts in 10x 3' data. We have two experiments: a standard 10x 3' library run on PBMCs (not the same batch as other experiments), and a modified 10x 3' run with a custom 3dG TSO to prevent artifact formation.

__Long-read monomer PBMC__ - run on two SMRTcell lanes, `m84174_240308_153036_s3` (control TSO) and `m84174_240308_173005_s4` (3dg TSO)
   * Raw data: [`gs://mdl-sc-isoform-2025-ms/sequencing_data/3dg_monomer`](https://console.cloud.google.com/storage/browser/mdl-sc-isoform-2025-ms/sequencing_data/3dg_monomer)
   
To download: `gcloud storage rsync -r gs://mdl-sc-isoform-2025-ms/sequencing_data/3dg_monomer path/for/data`

Total size is 2.7G.

__Short-read data__ - The short-read libraries were sequenced on a NovaSeq S4, demultiplexed and process in the same manner as the earlier short-read data. 
  * `fastq.gz` files: [`gs://mdl-sc-isoform-2025-ms/sequencing_data/3dg_illumina`](https://console.cloud.google.com/storage/browser/mdl-sc-isoform-2025-ms/sequencing_data/3dg_illumina)
  * Output from CellRanger: [`gs://mdl-sc-isoform-2025-ms/control-tso_10x_3p`](https://console.cloud.google.com/storage/browser/mdl-sc-isoform-2025-ms/control-tso_10x_3p) and [`gs://mdl-sc-isoform-2025-ms/3dg-tso_10x_3p`](https://console.cloud.google.com/storage/browser/mdl-sc-isoform-2025-ms/3dg-tso_10x_3p)

### Short-read processing

To process the short-read data we ran CellRanger as before. The 3dG TSO does not affect the structure of the sequences, so the standard software works without modification. We consolidate the summary data from the CellRanger output for Supplementary Table 3.

### Long-read processing

The analysis of the long-read data is relatively simple. There are no indexing sequences being used, and we just need to run `marti` to classify artifacts.

The `structure_counts.tsv` files from running `marti` are at [`gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/3dg_marti_output.tgz`](https://console.cloud.google.com/storage/browser/mdl-sc-isoform-2025-ms/notebook_checkpoints/3dg_marti_output.tgz).

For the MAS-seq (Kinnex) data, we run `skera` to show the effect of TSO cleaning on the standard 10x 3' data.

In [None]:
import pickle
from pathlib import Path

import numpy as np
import yaml

import matplotlib.pyplot as plt
import matplotlib.ticker as mtick

from mdl.sc_isoform_paper import today
from mdl.sc_isoform_paper.marti import CONFIG_DICT, SAMPLE_CONFIG, read_sample_reports
from mdl.sc_isoform_paper.skera import read_length_csv, read_ligation_csv
from mdl.sc_isoform_paper.plots import plot_concat_and_ligations, plot_stacked_concat

### Setup

In [None]:
root_dir = Path.home()
sh_dir = root_dir / "sh_scripts"

data_path = root_dir / "data"
figure_path = root_dir / "202501_figures"

# path to the short-read data (output from cellranger + pipseeker)
sr_data_paths = [
    data_path / "3dg_shortread" / "MDL_DevTSO",
    data_path / "3dg_shortread" / "MDL_ControlTSO"
]

lr_data_path = data_path / "3dg_monomer"
mas_data_path = data_path / "3dg_masseq"

In [None]:
# path for output of skera
skera_path = mas_data_path / f"{today}_skera"
skera_path.mkdir(exist_ok=True)

# path to the marti binary
marti_bin = root_dir / "marti/build/bin/marti"

# path for marti output
marti_path = lr_data_path / f"{today}_marti"
marti_path.mkdir(exist_ok=True)

# uncomment to download and extract into the data path
! # gcloud storage cp gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/3dg_marti_outputs.tgz - | tar -C {marti_path} -xzv

### Running marti

These are 10x 3' libraries, so we can use the same `marti` config for the adapters. Based on the results from notebook 03, we will use the default polya len of 20, as otherwise some TSO-TSO artifacts are misclassified.

If you have downloaded the outputs, you can skip to **Reading marti results**

In [None]:
lr_bams = sorted((lr_data_path / "raw").glob("*hifi_reads.bam"))
lr_bams

In [None]:
lr_sample_names = {
    "m84174_240308_153036_s3.hifi_reads": "ControlTSO",
    "m84174_240308_173005_s4.hifi_reads": "DevTSO",
}

In [None]:
default_polya_len = { "min_polyA_match": 20 }

with open(sh_dir / f"{today}_3dg_marti.sh", "w") as out:
    for bam_file in lr_bams:
        mp = marti_path / bam_file.stem

        # make a run directory for each file
        mp.mkdir(exist_ok=True)
        config_file = mp / "config.yaml"

        # write config file with appropriate parameters
        with open(config_file, "w") as out2:
            print(
                 yaml.dump(
                    {"input_bam": str(bam_file)}
                     | SAMPLE_CONFIG["10x 3'"]
                     | CONFIG_DICT
                     | default_polya_len,
                    sort_keys=False
                ),
                file=out2
            )

        print(f"{marti_bin} {config_file}", file=out)


### Reading marti results

We show that the TSO-TSO artifact is essentially gone.

In [None]:
sample_totals = {
    lr_sample_names[k]: v for k,v in read_sample_reports(marti_path).items()
}

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(3, 5))

name_list = ["ControlTSO", "DevTSO"]

x = np.arange(len(name_list))
vs = np.array([sample_totals[n]['TsoTso'] for n in name_list])
tots = np.array([sample_totals[n].total() for n in name_list])

ax.bar(x, vs / tots, width=0.6)
ax.set_xticks(x, name_list, fontsize="medium", rotation=90)
for j, (v, tot) in enumerate(zip(vs, tots)):
    ax.annotate(
        f"{v:,d}\n({(v / tot):.2%})",
        (j, v / tot),
        xytext=(0, 5),
        textcoords="offset points",
        ha="center",
        fontsize="small",
    )
ax.margins(x=0.3, y=0.2)
ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1, decimals=0))
plt.savefig(figure_path / "fig1f_3dg_tso.svg")
plt.show()

## 3dG MAS-seq processing

This experiment is fairly simple: we had three MAS-seq libraries made from 10x 3' kits.

 * `3dGNoKBClean` - The 10x capture used a 3dG TSO to prevent artifact formation. No purification was done.
 * `ControlNoKBClean` - Standard 10x 3' kit, arrays were made without any purification.
 * `ControlWithKBClean` - Standard 10x 3' kit with a purification step to remove TSO artifacts.

Here we are showing that the purification step is critically necessary for array formation when using the standard TSO, but the 3dG TSO prevents artifacts from forming and arrays can be created without issue.

#### Raw data

Raw data is in `gs://mdl-sc-isoform-2025-ms/sequencing_data/3dg_masseq`

To download:
```
mkdir -p data/3dg_masseq/raw/{3dGNoKBClean,ControlNoKBClean,ControlWithKBClean}
gsutil cp \
    gs://mdl-sc-isoform-2025-ms/sequencing_data/3dg_masseq/m84252_240709_213425_s4.hifi_reads.bcM0004.bam \
    data/3dg_masseq/raw/3dGNoKBClean/
gsutil cp \
    gs://mdl-sc-isoform-2025-ms/sequencing_data/3dg_masseq/m84252_240709_213425_s4.hifi_reads.bcM0002.bam \
    data/3dg_masseq/raw/ControlNoKBClean/
gsutil cp \
    gs://mdl-sc-isoform-2025-ms/sequencing_data/3dg_masseq/m84252_240709_213425_s4.hifi_reads.bcM0003.bam \
    data/3dg_masseq/raw/ControlWithKBClean/
```

#### Processed data

The stats from the skera output are in [`gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/3dg_skera_stats.pickle`](https://console.cloud.google.com/storage/browser/mdl-sc-isoform-2025-ms/notebook_checkpoints/3dg_skera_stats.pickle). If they are available you can run the next cell and skip to **QC Plots**.

In [None]:
raw_bams = list((mas_data_path / "raw").glob("*/*bam"))

sample_names = ["ControlNoKBClean", "ControlWithKBClean", "3dGNoKBClean"]
samples = {rb.stem: rb.parent.name for rb in raw_bams}

skera_stats_file = data_path / "3dg_skera_stats.pickle"

# uncomment to download
# ! gcloud storage cp gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/3dg_skera_stats.pickle  {data_path}/

samples

### Running skera

In [None]:
with open(sh_dir / f"{today}_skera.sh", "w") as out:
    for rb in raw_bams:
        # echo the command so we see progress
        print(
            f"echo skera split -j 16 {rb}",
            "metadata/mas16_primers.fasta",
            mas_data_path / "skera" / f"{samples[rb.stem]}.skera.bam",
            file=out
        )
        print(
            f"skera split -j 16 {rb}",
            "metadata/mas16_primers.fasta",
            mas_data_path / "skera" / f"{samples[rb.stem]}.skera.bam",
            file=out
        )

## QC Plots
After running skera, we use some QC code to visualize the results. The results can be loaded from `3dg_skera_stats.pickle`.

In [None]:
if skera_stats_file.exists():
    with open(skera_stats_file, "rb") as fh:
        counts, read_lengths = pickle.load(fh)
else:
    skera_ligations = sorted(skera_path.glob("*ligations.csv"))
    skera_read_len = sorted(skera_path.glob("*read_lengths.csv"))

    # ligation counts (pairs of adapters)
    counts = {
        skl.name.split(".")[0]: read_ligation_csv(skl, 16) 
        for skl in skera_ligations
    }

    # getting read lengths--we don't need s-read lengths here
    read_lengths = {
        srl.name.split(".")[0]: read_length_csv(srl)[0]
        for srl in skera_read_len
    }

    read_lengths = {k: np.array(read_lengths[k]) for k in sample_names}

    with skera_stats_file.open("wb") as out:
        pickle.dump((counts, read_lengths), out)

In [None]:
pct_full = {k: (read_lengths[k][:,1] == 16).mean() for k in read_lengths}
for k in sample_names:
    print(f"{k:18}\t{pct_full[k]:.2%}")

In [None]:
for k in sample_names:
    fig = plot_concat_and_ligations(read_lengths[k], counts[k], True)
    fig.suptitle(k)
    plt.savefig(figure_path / f"supp_fig9_{k}.svg")
    plt.show()

In [None]:
# this is an alternate view on the concatentation/length histogram, collapsing the length
# distribution to better show the proportion of array sizes
fig = plot_stacked_concat(
    read_lengths,
    normalize=True,
    labels=["Control\nNo clean", "Control\nw/ clean", "3dG\nNo clean"]
)
plt.savefig(figure_path / "fig1g_3dg_stacked.svg")
plt.show()