The following workbook provides an example of how to use scanpy to classify cell state with Teton cytoprofiling data. As part of this process, it also provides examples of how to
* load and filter data 
* convert data into the AnnData format
* perform basic processing of the data with scanpy
* generate and plot a UMAP projection with scanpy

In order to use this workbook, you must have
* AWS CLI available in your path
* The following python dependencies installed: cytoprofiling, scanpy, anndata, numpy, pandas

In the cell below, select the location of the data and wells to be included in the analysis. The defaults are a publicly available dataset for demonstration. 

In [None]:
run_directory = "s3://element-public-data/cytoprofiling/ACE-3376"
wells = ["A2",]

Load the data from S3. Default cell filtering is performed with the cytoprofiling 

In [None]:
import pandas as pd
import json
from tempfile import TemporaryDirectory
import cytoprofiling
import subprocess

# If credentials are already defined in the environment, they 
# may need to be disabled to make unathenticated requests to public data
# set in S3
#
# import os
# if "AWS_ACCESS_KEY_ID" in os.environ:
#     del os.environ["AWS_ACCESS_KEY_ID"]

panel_file = f"{run_directory}/Panel.json"
with TemporaryDirectory() as temp_dir:
    # download the panel file from s3 and open
    subprocess.run(["aws", "s3", "cp" ,panel_file, temp_dir], check=True)
    panel_json = json.load(open(f"{temp_dir}/Panel.json", "rb"))

df = pd.read_parquet("s3://element-public-data/cytoprofiling/ACE-3376/Cytoprofiling/Instrument/RawCellStats.parquet")
df = df[df["Well"].isin(wells)]
df = cytoprofiling.filter_cells(df)
df = cytoprofiling.normalize_cytoprofiling(df)

Process the data with scanpy to assign cell phase and display in a UMAP projection. 

In [None]:

import scanpy as sc
import numpy as np

import cytoprofiling

# Convert dataframe to anndata
adata = cytoprofiling.cytoprofiling_to_anndata(df, panel_json)

# filter data columns to only include simple counts for protein and RNA
adata = adata[:,(~adata.var["is_unassigned"]) & (~adata.var["is_nuclear"]) & np.isin(adata.var["measurement_type"], ["RNA",])]

# convert column names to gene names and remove any resulting duplicates 
adata.var_names = adata.var["gene"]
adata = adata[:, ~adata.var_names.duplicated()].copy()

# do processing of data to prepare for UMAP and cell cycle determination
n_comps = 10
sc.pp.log1p(adata)
sc.tl.pca(adata, n_comps=n_comps)
sc.pp.neighbors(adata, n_pcs=n_comps)

# assign cell phase
cytoprofiling.assign_cell_phase(adata)

# calculate UMAP
sc.tl.umap(adata)

# plot UMAP with calculated cell phase
sc.pl.umap(adata, color="phase")