## Barnyard Analysis

Stats and figures for the barnyard analyses. We performed a human+mouse (K562 + 3T3) Fluent experiment in-house and are comparing the results to public data from 10x Genomics (HEK293T + 3T3).

The Fluent data was saved as an `npz` file in the previous notebook.

The 10x data files `10k_hgmm_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.h5` can be downloaded from their site: [10x 3' data](https://www.10xgenomics.com/resources/datasets/10-k-1-1-mixture-of-human-hek-293-t-and-mouse-nih-3-t-3-cells-3-v-3-1-chromium-controller-3-1-standard-6-1-0), [10x 5' data](https://www.10xgenomics.com/datasets/10-k-1-1-mixture-of-human-hek-293-t-and-mouse-nih-3-t-3-cells-5-v-2-0-chromium-controller-2-standard-6-1-0)

We are using the file labeled `Feature / cell matrix HDF5 (raw)` which contains all the assigned barcodes.

#### Figure data

To re-generate the figures without processing these files at all, download and load `barnyard_stats.pickle` file from `gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/barnyard_stats.pickle` (size is 100M). It contains the arrays necessary to make the ambient contamination and barnyard plots at the bottom of this notebook.


In [None]:
import csv
import gzip
import pickle
from collections import Counter
from pathlib import Path

import sparse

import matplotlib.pyplot as plt

# installed from www.github.com/methodsdev/isoscelles
from mdl.isoscelles.io import read_10x_h5

from mdl.sc_isoform_paper.constants import SAMPLE_COLORS
from mdl.sc_isoform_paper.plots import plot_ambient_distributions, plot_barn

### Loading data

We load the raw data as before, caching the 10x as an `npz` file as we do so. We don't actually care about the cell barcodes here--we just need to know which features are human and which are mouse.

We have a consistent feature list because we used the same combined reference for analysis. The features are in the same order, but the Fluent file has columns `gene_id,gene_name,feature_type` while the 10x file has column `gene_name,gene_id` (the reverse order).

In [None]:
root_dir = Path.home()
data_path = root_dir / "data"
data_path_pipseq = data_path / "pipseq_barnyard"
data_path_10x = data_path / "10x_barnyard"

figure_path = root_dir / "202501_figures"

In [None]:
# remove `echo` to create a directory for the 10x data, and download it
! echo mkdir {data_path_10x}
! echo wget -P {data_path_10x} \
    https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_hgmm_3p_nextgem_Chromium_Controller/10k_hgmm_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.h5 
! echo wget -P {data_path_10x} \
    https://cf.10xgenomics.com/samples/cell-vdj/6.1.0/10k_hgmm_5pv2_nextgem_Chromium_Controller/10k_hgmm_5pv2_nextgem_Chromium_Controller_raw_feature_bc_matrix.h5

In [None]:
# remove `echo` to download
! echo gcloud storage cp gs://mdl-sc-isoform-2025-ms/notebook_checkpoints/barnyard_stats.pickle {data_path}/

In [None]:
barnyard_stats = data_path / "barnyard_stats.pickle"
if barnyard_stats.exists():
    with barnyard_stats.open("rb") as fh:
        barn_h_10x_3p, barn_m_10x_3p, barn_h_10x_5p, barn_m_10x_5p, barn_h_pipseq, barn_m_pipseq = pickle.load(fh)

### Reading and processing data

If data was loaded from the pickle, these cells can be skipped and you can jump to the **Ambient Contamination** section.

In [None]:
# generated in notebook 01
barn_pipseq = sparse.load_npz(data_path_pipseq / "raw_matrix" / "pipseq_barnyard.npz")

with gzip.open(data_path_pipseq / "raw_matrix" / "features.tsv.gz", "rt") as fh:
    barn_g = [tuple(r[:2]) for r in csv.reader(fh, delimiter="\t")]

In [None]:
npz_10x_3p_file = data_path_10x / "10k_hgmm_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.npz"
if not npz_10x_3p_file.exists():
    barn_10x_3p, barn_10x_3p_bc, barn_10x_3p_g = read_10x_h5(npz_10x_3p_file.with_suffix(".h5"))

    sparse.save_npz(npz_10x_3p_file, barn_10x_3p)
    
    with gzip.open(data_path_10x / "3p_features.tsv.gz", "wt") as out:
        print("\n".join("\t".join(g) for g in barn_10x_3p_g), file=out)
else:
    barn_10x_3p = sparse.load_npz(npz_10x_3p_file)

    with gzip.open(data_path_10x / "3p_features.tsv.gz", "rt") as fh:
        barn_10x_3p_g = [tuple(r[:2]) for r in csv.reader(fh, delimiter="\t")]

print("feature list is equal but elements are reversed:", barn_g == [t[::-1] for t in barn_10x_3p_g])

In [None]:
npz_10x_5p_file = data_path_10x / "10k_hgmm_5pv2_nextgem_Chromium_Controller_raw_feature_bc_matrix.npz"
if not npz_10x_5p_file.exists():
    barn_10x_5p, barn_10x_5p_bc, barn_10x_5p_g = read_10x_h5(npz_10x_5p_file.with_suffix(".h5"))

    sparse.save_npz(npz_10x_5p_file, barn_10x_5p)
    
    with gzip.open(data_path_10x / "5p_features.tsv.gz", "wt") as out:
        print("\n".join("\t".join(g) for g in barn_10x_5p_g), file=out)
else:
    barn_10x_5p = sparse.load_npz(npz_10x_5p_file)

    with gzip.open(data_path_10x / "5p_features.tsv.gz", "rt") as fh:
        barn_10x_5p_g = [tuple(r[:2]) for r in csv.reader(fh, delimiter="\t")]

print("feature list is equal but elements are reversed:", barn_g == [t[::-1] for t in barn_10x_5p_g])

#### Splitting human and mouse genes

We need to separate out the UMIs for each species. The human genes are listed first in the matrix, then the mouse.

In [None]:
print(Counter(g[0].split("_")[0] for g in barn_g))

print(
    "Index of last human feature: ",
    max(i for i,g in enumerate(barn_g) if g[0].startswith("GRCh38")),
    "\n"
    "Index of first mouse feature: ",
    min(i for i,g in enumerate(barn_g) if g[0].startswith("mm10"))
)

In [None]:
barn_numis_pipseq = barn_pipseq.sum(1).todense()
barn_numis_10x_3p = barn_10x_3p.sum(1).todense()
barn_numis_10x_5p = barn_10x_5p.sum(1).todense()

In [None]:
# kneeplot of nUMIs for each barnyard experiment
plt.plot(sorted(barn_numis_10x_3p, reverse=True), label="10x 3'", color=SAMPLE_COLORS["10x 3'"])
plt.plot(sorted(barn_numis_10x_5p, reverse=True), label="10x 5'", color=SAMPLE_COLORS["10x 5'"])
plt.plot(sorted(barn_numis_pipseq, reverse=True), label="PIPseq", color=SAMPLE_COLORS['PIPseq'])
plt.xscale("log")
plt.yscale("log")

plt.xlabel("Barcodes")
plt.ylabel("# UMIs")
plt.legend()
plt.show()

Here we're just looking at the kneeplots (barcodes vs # UMIs) for each barnyard experiment. The PIPseq data has ~3.6x more sequencing, but there are also clear differences in the number of cells sequenced (the first knee) and the number of barcodes with ambient reads (the second knee)

In [None]:
# calculate nUMIs per-species for each experiment

barn_h_pipseq = barn_pipseq[:,:36601].sum(1).todense()
barn_m_pipseq = barn_pipseq[:,36601:].sum(1).todense()

barn_h_10x_3p = barn_10x_3p[:,:36601].sum(1).todense()
barn_m_10x_3p = barn_10x_3p[:,36601:].sum(1).todense()

barn_h_10x_5p = barn_10x_5p[:,:36601].sum(1).todense()
barn_m_10x_5p = barn_10x_5p[:,36601:].sum(1).todense()

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(18, 5))

titles = ["10x 3'", "10x 5'", "PIPseq"]
labels = ["Total", "Human", "Mouse"]

for i, ns in enumerate(
    (
        (barn_numis_10x_3p, barn_h_10x_3p, barn_m_10x_3p), 
        (barn_numis_10x_5p, barn_h_10x_5p, barn_m_10x_5p),
        (barn_numis_pipseq, barn_h_pipseq, barn_m_pipseq)
    )
):
    for n,lbl in zip(ns, labels):
        ax[i].plot(sorted(n, reverse=True), label=lbl)

    ax[i].axhline(1000, linestyle=":", color='grey')
    ax[i].axhline(10000, linestyle=":", color='grey')
    ax[i].set_xscale("log")
    ax[i].set_yscale("log")
    ax[i].set_xlabel("Barcodes")
    ax[i].set_ylabel("# UMIs")
    ax[i].set_title(titles[i])
    ax[i].legend()

plt.show()

Here we break up the experiments by species, to visualize the number of human and mouse cells seen in each experiment. This just confirms that all the experiments contain a roughly equal mix of the two species. The two dotted lines show possible high and low cutoffs for confidently calling real cells and empty droplets respectively.

In [None]:
# doublet rates for a variety of cutoffs

for (i, hm_data, h_data, m_data) in (
    ("10x 3'", barn_numis_10x_3p, barn_h_10x_3p, barn_m_10x_3p),
    ("10x 5'", barn_numis_10x_5p, barn_h_10x_5p, barn_m_10x_5p),  
    ("fluent", barn_numis_pipseq, barn_h_pipseq, barn_m_pipseq)
):
    for cutoff in (2750, 3000, 10000):
        c_ix = hm_data > cutoff
        h_ix = h_data > cutoff
        m_ix = m_data > cutoff
        ix = h_ix | m_ix
        ix2 = h_ix & m_ix
        
        print(
            i,
            *(f"{v:,d}" for v in (cutoff, h_ix.sum(), m_ix.sum(), ix2.sum(), ix.sum())),
            f"{ix2.sum() / c_ix.sum():.2%}",
            f"{ix2.sum() / ix.sum():.2%}",
            sep="\t",
        )


### Ambient contamination

In [None]:
plot_ambient_distributions(
    ["PIPseq", "10x 3'", "10x 5'"],
    [barn_h_pipseq, barn_h_10x_3p, barn_h_10x_5p],
    [barn_m_pipseq, barn_m_10x_3p, barn_m_10x_5p],
    output_file=figure_path / "supp_fig5.svg"
)

This plot shows the percent of other-species reads in what we consider single-cell barcodes (i.e. they contain many UMIs and aren't doublets). For each such barcode we consider how many UMIs for the other species are seen--this gives us an estimate of the contamination from ambient mRNA. In both comparisons, PIPseq and 10x appear roughly similar in terms of contamination percentage.

### Barnyard plots

The following is the standard barnyard plot for the three datasets: a scatter plot of human UMIs vs mouse UMIs for each barcode. We show this on a log scale to illustrate that all of the called barcodes contain a notable number of UMIs from the other species in the experiment; this is difficult to observe with linear scaling.



In [None]:
fig, ax = plt.subplots(1, 3, figsize=(18, 5))

plot_barn(fig, ax[0], "10x 3'", barn_h_10x_3p, barn_m_10x_3p, 2750)
plot_barn(fig, ax[1], "10x 5'", barn_h_10x_5p, barn_m_10x_5p, 3000)
plot_barn(fig, ax[2], "PIPseq", barn_h_pipseq, barn_m_pipseq, 10000)

fig.suptitle("Supplementary figure: barnyard experiments")

plt.savefig(figure_path / "supp_fig4.svg")
plt.show()

### Saving data for figure 1

We just save the data for the contamination and barnyard plots shown above, so we can reproduce them without re-running all of this.

In [None]:
if not barnyard_stats.exists():
    with open(barnyard_stats, "wb") as out:
        pickle.dump(
            (barn_h_10x_3p, barn_m_10x_3p, barn_h_10x_5p, barn_m_10x_5p, barn_h_pipseq, barn_m_pipseq),
            out
        )