# Summary

This workbook will generate a variety of summary plots for pre-cultured cytoprofiling flowcells provided by  Element Biosciences. These flowcells are seeded with either HeLa or A549 cells in alternating wells

## Hela (Human Cervical Cancer)
* Derived from cervical adenocarcinoma 
* Highly proliferative and genetically unstable
* Commonly used in studies of cell cycle, DNA repair, and signaling
* Strong NF-κB and p53 responses to stress stimuli

## A549 (Human Lung Carcinoma)
* Derived from alveolar basal epithelial cells (lung adenocarcinoma)
* Moderately proliferative, more epithelial in morphology
* Frequently used in respiratory research and drug testing
* Known for robust MAPK/ERK signaling and oxidative stress responses

## Comparison
* Both are adherent epithelial cancer lines but differ in tissue origin (cervix vs. lung)
* Response profiles to cytokines and inhibitors can diverge, making them complementary models for signaling studies


# Using the notebook

## Prerequisites

The following python packages must be installed
* matplotlib
* numpy 
* pandas
* scanpy
* cytoprofiling (see instrunctions at https://github.com/Elembio/cytoprofiling)


## Executing the notebook
Before executing the workbook, download the data for the three files noted below from the run directory for your run and ensure the paths below point to location on the local sytem. 

| File | Path in Run Directory |
|-|-|
| RunStats.json | Cytoprofiling/Instrument/RunStats.json |
| RawCellStats.parquet | Cytoprofiling/Instrument/RawCellStats.parquet |
| Panel.json | Panel.json |

In [None]:
run_stats_json = "RunStats.json"
panel_json = "Panel.json"
raw_cell_stats_parquet = "RawCellStats.parquet"

Define plotting functions

In [None]:
from matplotlib import pyplot as plt
import numpy as np

def plot_per_batch_and_well(wells, batch2well_values, label, run_name):
    batch_names = sorted(list(batch2well_values.keys()))

    width = 1 / (len(batch_names) + 1)

    plt.figure(figsize=(10,6))
    for batch_idx, batch in enumerate(batch_names):
        values = [float("nan") if abs(float(val) + 999) < 0.0001 else float(val) for val in batch2well_values[batch]]
        plt.bar([pos + batch_idx * width for pos in range(len(wells))], values, width = width, label=batch)

    plt.xticks(np.arange(len(wells)) + width/2, [f"{well}" for well in wells]) 
    plt.xticks(rotation=-45, ha="left")
    plt.ylabel(label)
    plt.xlabel("Well")
    plt.title(run_name)
    plt.legend()
    plt.tight_layout()
    plt.show()
    plt.close()

def plot_per_well(wells, values, label, run_name):
    plt.figure(figsize=(10,6))
    plt.bar(wells, values)
    
    plt.xticks(range(len(wells)), [f"{well}" for well in wells]) 
    plt.xticks(rotation=-45, ha="left")
    plt.ylabel(label)
    plt.xlabel("Well")
    plt.title(run_name)
    plt.tight_layout()
    plt.show()
    plt.close()

Load the json files (run stats, panel) into python dictionaries

In [None]:
import json

with open(run_stats_json) as f:
    run_stats = json.load(f)

with open(panel_json) as f:
    panel = json.load(f)

wells = sorted([well_data["WellLocation"] for well_data in run_stats["CytoStats"]["Wells"]])
batches = sorted([batch_data["BatchName"] for batch_data in run_stats["DemuxStats"]["Batches"]])

# Barcoding performance metrics

Groups of barcodes are sequenced in serial batches, where each batch is defined by a specific sequencing primer. The plots below detail metrics assessing the barcoding performance in each batch for each well in the flowcell. 

| Metric | Description | Expected Value |
|-|-|-|
| PercentAssignedReads | Of all polonies, percentage assigned to an expected barcode | > 70% |
| PercentMismatch | Of all polonies assigned to a barcode, percentage assigned with a mismatch | < 35% |


In [None]:
for demux_stat in ["PercentAssignedReads", "PercentMismatch"]:
    values = {}
    for batch_data in run_stats["DemuxStats"]["Batches"]:
        batch_values = []
        for well in wells:
            well_value = float("nan")
            well_data = [well_data for well_data in batch_data["Wells"] if well_data["WellLocation"] == well]
            if len(well_data) == 1:
                well_value = well_data[0][demux_stat]
                if abs(float(well_value) + 999) < 0.0001:
                    well_value = float("nan")
            batch_values.append(well_value)
        values[batch_data["BatchName"]] = batch_values
    plot_per_batch_and_well(wells, values, demux_stat, run_stats["RunName"])

# Cell segmentation performance metrics

Cell segmentation is performed based on the cell paint images for the cell membrane, nucleus and actin. The metrics below summarize the results of the segmentation process for each well in the flowcell. 

| Metric | Description | Expected Value |
|-|-|-|
| PercentConfluency | Fraction of well area occupied by cells | 25-50% (variable based on cell seeding) |
| CellCount | Number of objects detected during segmentation | >10,000 (variable based on cell seeding) |
| MedianCellDiameter | Approximate median diameter of cells in microns | ~35 um |
| PercentNucleatedCells | Fraction of cells with segmented nucleus | > 97% | 

In [None]:
for segmentation_metric in ["PercentConfluency", "CellCount", "MedianCellDiameter", "PercentNucleatedCells"]:
    values = []
    for well in wells:
        well_value = float("nan")
        well_data = [well_data for well_data in run_stats["CytoStats"]["Wells"] if well_data["WellLocation"] == well]
        if len(well_data) == 1:
            well_value = float("nan") if abs(well_data[0][segmentation_metric]+999) < 0.0001 else well_data[0][segmentation_metric]
        values.append(well_value)
    plot_per_well(wells, values, segmentation_metric, run_stats["RunName"])

# Cell assignment metrics

After barcoding and cell segmentation is complete, individual barcodes are assigned to cells. The metrics below summarize this process. 

| Metric | Description | Expected Value |
|-|-|-|
| AssignedCountsPerMM2 | Number of assigned polonies per mm2 of cell area | ~150,000 (protein), 200,000-300,000 (RNA) |

In [None]:
for cyto_stat in ["AssignedCountsPerMM2",]:
    values = {}
    for well_data in run_stats["CytoStats"]["Wells"]:
        for batch_data in well_data["Batches"]:
            batch_name = batch_data["BatchName"]
            if batch_name.startswith("CP"):
                continue
            batch_value = batch_data.get(cyto_stat, -999)
            if batch_value + 999 < 0.0001:
                batch_value = float("nan")
            values.setdefault(batch_data["BatchName"], []).append(batch_value)
    plot_per_batch_and_well(wells, values, cyto_stat, run_stats["RunName"])

Load the raw cell stats into a data frame

In [None]:
import pandas as pd
from cytoprofiling import filter_cells

df = pd.read_parquet("RawCellStats.parquet")
df = filter_cells(df)

# Correlation metrics

For each pair of wells, we can calculate the correlation of log-transformed average counts as a measure of reproducibility. For both RNA and protein data types, replicates wells should have R2 > 0.95. 

In [None]:
import numpy as np
import scanpy as sc
import seaborn as sns
from matplotlib import pyplot as plt
from cytoprofiling import cytoprofiling_to_anndata
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="anndata")

for data_type in ["RNA", "Protein"]:

    # Convert dataframe to anndata
    adata = cytoprofiling_to_anndata(df, panel)

    # filter data columns to only include simple counts for data_type
    adata = adata[:,(~adata.var["is_unassigned"]) & (~adata.var["is_nuclear"]) & np.isin(adata.var["measurement_type"], [data_type,])]


    wells_x = adata.obs["Well"].unique()


    # average anndata object adata by well observation
    grouped_adata = sc.get.aggregate(adata, func = 'mean', by = "Well")

    well2label_x = {}
    for well in wells_x:
        well2label_x[well] = df["WellLabel"][df["Well"] == well].unique()[0]

    distances_2d = []
    for well_idx in range(len(wells_x)):
        current_distances = []
        for well_idy in range(len(wells_x)):            
            well_x = wells_x[well_idx]
            well_y = wells_x[well_idy]

            r2 = np.corrcoef(np.log2(grouped_adata[grouped_adata.obs["Well"] == well_x].layers["mean"].flatten()), np.log2(grouped_adata[grouped_adata.obs["Well"] == well_y].layers["mean"].flatten()))[0][1] ** 2
            current_distances.append(r2)
            if well_idy == well_idx:
                continue
            
        distances_2d.append(current_distances)

    sns.heatmap(distances_2d, annot=False, cmap='coolwarm', yticklabels=wells_x, xticklabels= [f"{well} {well2label_x[well]}" for well in wells_x])

    plt.title(f"{run_stats['RunName']} well correlation - {data_type}")
    plt.show()
    plt.close()

# UMAP projection and cell state 

The single-cell data can be used to generate a UMAP projection. Across the entire flowcell, the UMAP projection should separate the data into two well-defined clusters, with the two different cell types comprising each cluster. Within each cluster, a circular pattern should exist, corresponding to the annotated cell state. 

In [None]:
import scanpy as sc
import numpy as np
from cytoprofiling import cytoprofiling_to_anndata, assign_cell_phase
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="anndata")

# Convert dataframe to anndata
adata = cytoprofiling_to_anndata(df, panel)

# filter data columns to only include simple counts for protein and RNA
adata = adata[:,(~adata.var["is_unassigned"]) & (~adata.var["is_nuclear"]) & np.isin(adata.var["measurement_type"], ["RNA", "Protein"])]

# convert column names to gene names and remove any resulting duplicates 
adata.var_names = adata.var["gene"]
adata = adata[:, ~adata.var_names.duplicated()].copy()

# do processing of data to prepare for UMAP and cell cycle determination
n_comps = 10
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.tl.pca(adata, n_comps=n_comps)
sc.pp.neighbors(adata, n_pcs=n_comps)

# assign cell phase
assign_cell_phase(adata)

# calculate UMAP
sc.tl.umap(adata)

# plot UMAP with well label
sc.pl.umap(adata, color="WellLabel")

# plot UMAP with calculated cell phase
sc.pl.umap(adata, color="phase")